在Oracle Linux服务器部署ComfyUI工作流

登录服务器，了解机器整体配置，安装Python3.12 和git

#登录
ssh -i C:\Users\xxx\.ssh\ssh-key-2025-05-12.key opc@xxx.xxx.xx.xxx
Activate the web console with: systemctl enable --now cockpit.socket
Last login: Mon May 12 12:57:35 2025 from 98.96.223.133
#查看操作系统信息
[opc@instance-20250512-xxx ~]$ uname -a
Linux instance-20250512-xxx5.15.0-306.177.4.el8uek.x86_64 #2 SMP Wed Feb 19 10:29:16 PST 2025 x86_64 x86_64 x86_64 GNU/Linux
Linux @facemosaic 5.15.0-1063-aws #69~20.04.1-Ubuntu SMP Fri May 10 19:20:12 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
#为Oracle的机器，查看Oracle Linux的版本号
cat /etc/oracle-release # 输出为Oracle Linux Server release 8.10 #此版本支持的最高python版本为n 3.12
#Ubuntu上看操作系统版本
ubuntu@facemosaic:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.6 LTS
Release:        20.04
Codename:       focal
#查看时间
date
Fri May 16 07:38:24 GMT 2025 #此云服务器上的时间为零时区的时间，比东八区早8小时,AWS上机器则为中国时区
#查看英伟达显卡型号
[opc@instance-20250512-xxx ~]$ lspci | grep -i nvidia
00:04.0 3D controller: NVIDIA Corporation GA102GL [A10] (rev a1)
00:1e.0 3D controller: NVIDIA Corporation Device 2237 (rev a1)
#查看GPU详情
 nvidia-smi
Tue Jun  3 15:42:55 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    On  | 00000000:00:1E.0 Off |                    0 |
|  0%   27C    P0              58W / 300W |   5726MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     50436      C   /usr/bin/python                            5716MiB |
+---------------------------------------------------------------------------------------+
#查看CPU核数
cat /proc/cpuinfo| grep "cpu cores"| uniq
# 查看RAM 单位G
free -g
# 查看硬盘空间 
df -h
#查看python版本号
 python3 --version
# python版本为3.6. 需要安装更新的版本。 参考文档:https://docs.oracle.com/cd/F22978_01/python/OL8-PYTHONG.pdf
sudo dnf install python3.12
python3.12 --version #输出为Python 3.12.8
# 将python3 的别名指向3.12(3.6的版本仍然保留，误删)
sudo alternatives --set python3 /usr/bin/python3.12 # 在Ubuntu服务器上可以不设置，只要创建python虚拟环境的是python3.12就行
#安装pip 可以按照python3版本的pip sudo dnf install python3-pip python3-setuptools python3-wheel
# 也可以指定小版本，安装适配python3.12版本的pip
sudo dnf install python3.12-pip python3.12-setuptools python3.12-wheel
# 启用pip的别名，这样以后就可以用pip命令来调用pip3.12了
sudo update-alternatives --install /usr/bin/pip pip /usr/bin/pip3.12 1
# 如果是Ubuntu系统则通过以下命令安装pip
curl -O https://bootstrap.pypa.io/get-pip.py
python3.12 get-pip.py
python3.12 -m pip

# 安装git  卸载时同样需要sudo的前缀：sudo dnf remove git
sudo dnf install git -y
# 创建 ComfyUI的项目文件的目录并进入此目录
mkdir ~/local/comfy
cd ~/local/comfy
# 下载ComfyUI框架代码
git clone https://github.com/comfyanonymous/ComfyUI.git
# ComfyUI 下载好后，在ComfyUI/custom_nodes目录下运行git clone https://github.com/Comfy-Org/ComfyUI-Manager.git
# 来下载节点管理器

配置Python虚拟环境

# 查看登录后当前目录下文件，发现无文件或者文件夹
ls
# 创建文件夹，如无中间目录则自动创建
mkdir -p ~/local/python/env # -p表示中间目录不存在则自动创建
# 进入虚拟环境的专用目录 
cd ~/local/python/env
# 在此目录下创建名字叫comfy_try_on的虚拟环境，由于上面的配置，这里的python3已经自动重定向到python 3.12
 python3 -m venv comfy_try_on
# 启用该虚拟环境，退出该虚拟环境的命令: 直接输deactivate; 删除该虚拟环境，则直接删除目录即可rm -r comfy_try_on
source comfy_try_on/bin/activate
# 为了能够在任何目录都能够方便的启用Python虚拟环境，我们加入以下步骤
# 创建并进入bin目录
mkdir ~/bin
 cd ~/bin
# 创建名字叫activate的脚本文件
sudo touch activate
# 给文件增加执行权限
sudo chmod ug+x activate
# 在文件里写入快速启动python虚拟环境的脚本代码，注意文本里有变量占位符$1，所以第一行的EOF带引号
sudo tee ~/bin/activate <<'EOF'
> #!/bin/bash
> source ~/local/python/env/$1/bin/activate
> EOF
#清空此文件的命令是执行sudo tee ~/bin/activate > /dev/null后按 Ctrl + D 结束输入
#将bin文件加入系统目录
export PATH="~/bin:$PATH"
# 此时在任意位置都可以用此命令启动某个Python虚拟环境了

启动服务

# 进入ComfyUI目录，在此目录下启用python虚拟环境
cd ~/local/comfy/ComfyUI
source activate comfy_try_on
# 在python虚拟环境里安装pytorch, 于系统pytorch解耦
 pip install torch torchvision torchaudio 
# 如果要指定安装某个torch版本: 
pip install torch==2.6.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
pip install torch==2.5.1 torchvision  torchaudio --index-url https://download.pytorch.org/whl/cu121
#安装节点，这里第一次安装，安装的是ComfyUI框架默认的常用节点
pip install -r requirements.txt
#在服务器防火墙里启用端口，关闭多个端口的命令是sudo firewall-cmd --zone=public --remove-port=7860/tcp --remove-port=7861/tcp --permanent
sudo firewall-cmd --permanent --add-port=8188/tcp 
# 重启机器防火墙以生效
sudo firewall-cmd --reload
#查看端口是否已经打开
sudo firewall-cmd --zone=public --list-ports
# 端口可用的3个条件: 
#1. 服务器已经在本地防火墙里打开端口 2. 云服务器的控制台页面里的安全策略里已经打开此端口 3. 机器上启动了Http服务来监听此端口
#启动ComfyUI管理器，后面是本地访问的地址和端口，不设置则默认为http://127.0.0.1:8188
#python3 main.py --listen=192.168.98.xx --port=98xx
#这里需要同时能本地访问以及远程互联网访问，则是下面的命令。参考:[Why does my python TCP server need to bind to 0.0.0.0 and not localhost or it's IP address](https://stackoverflow.com/questions/38256851/why-does-my-python-tcp-server-need-to-bind-to-0-0-0-0-and-not-localhost-or-its)
python main.py --listen 0.0.0.0
# 命令行显示启动成功后，即显示
Starting server
To see the GUI go to: http://0.0.0.0:8188
则可以在互联网上可以通过IP:8188访问此页面了，如不能访问，则通过这个网站再确认一下端口是否以及打开: https://tool.chinaz.com/port
# 关闭服务的方法：用页面上的ComfyUI Manager，或者ctrl+c。如果是在本地机器的cmd命令行上跑，则直接关掉cmd也可以

给ComfyUI进程添加守护进程

#新建守护进程的启动脚本
nano run_comfyui.sh
#脚本里是如下代码
   #!/bin/bash
   while true
   do
       echo "Starting ComfyUI..."
       python main.py --listen 0.0.0.0
       echo "ComfyUI crashed or exited. Restarting in 3 seconds..."
       sleep 3
   done
# 给脚本添加执行权限
chmod +x run_comfyui.sh
# 以后台进程下形式运行
nohup ./run_comfyui.sh > comfyui.log 2>&1 &
#命令解释
nohup
作用：让后面启动的进程忽略挂起信号（SIGHUP），即使你关闭终端，进程也不会被终止。
常用场景：在远程服务器上运行长时间任务，防止 SSH 断开后进程被杀死。
./run_comfyui.sh
作用：执行当前目录下的 run_comfyui.sh 脚本。
注意：脚本需要有可执行权限（chmod +x run_comfyui.sh）。
> comfyui.log
作用：将标准输出（stdout）重定向到 comfyui.log 文件。
效果：脚本运行时的正常输出（如 echo、print）都会写入 comfyui.log，不会显示在终端。
2>&1
作用：将标准错误输出（stderr）重定向到标准输出（stdout）的位置。
效果：所有错误信息也会写入 comfyui.log，不会显示在终端。
解释：
2 代表标准错误（stderr）。
1 代表标准输出（stdout）。
2>&1 表示“把标准错误输出重定向到标准输出的位置”。
&
作用：让整个命令在后台运行（即不占用当前终端，可以继续输入其他命令）。
效果：你可以关闭终端或继续在当前终端执行其他操作，脚本会在后台继续运行
#gradio的应用也可以设置成把日志输出到文件，这样万一ssh端口连接，日志还是可以看得到的
python gradio_tryon.py > gradio.log 2>&1

配置按日的日志文件

cd /etc/logrotate.d/
sudo nano comfyui
# 输入以下文本，注释不能写入
/home/opc/local/comfy/ComfyUI/comfyui.log {
    daily                  # 每天轮转一次
    rotate 7               # 最多保留7个历史日志
    compress               # 轮转后的日志自动gzip压缩
    missingok              # 日志不存在时不报错
    notifempty             # 日志为空时不轮转
    copytruncate           # 复制后清空原日志（适合nohup等后台写入的日志）
    dateext                # 轮转文件名加日期后缀
    su opc opc          # 表示用 opc 用户和组来执行日志轮转，这样 logrotate 就不会因为权限问题而拒绝操作。
}
#执行日志轮转
sudo logrotate -f /etc/logrotate.d/comfyui
# 此时在/home/opc/local/comfy/ComfyUI/目录下会有一个类似comfyui.log-20250530.gz的文件，可用zcat查看内容

导入和安装节点/模型
在web页面导入工作流后，点击页面上的Manager按钮，然后点击Custom Node Mager，在此页面一键安装缺失的节点

image.png
在导入工作流时，会主动弹窗提示要不要安装缺失的节点。此时不要点击，直接叉掉，因为这里的manager试了几次，经常安装过程中会导致ComfyUI的进程退出。叉掉后，点击顶部的Manager按钮，从这个如果下载缺失的节点，中断进程的bug的发生率会小一些
导入工作流，安装缺失节点后，点击运行，还是可能有很多报错，比如模型依赖缺失，库版本冲突。只要最终能运行成功，cmd上的报错可以不用管。如果无法运行，则结合web弹窗提示和cmd上的log，解决相应的错误。
如果是Node或者Model缺失，可以在Manager里一键下载；如果是Node报错，也可以试试Manager里的Node Fix功能; 但实际使用发现Model 下载器里，显示的"工作流中的模型"列表，有些是工作流里不需要的，所有不要一键下载，而是选择需要的下载.
本次好几个model都需要手动安装，部分模型的下载来自这些地址
fluxgym安装教程：只需几步即可开始训练模型！以及这里https://huggingface.co/comfyanonymous
https://huggingface.co/hfmaster/models/tree/main/flux
也可以本地下载后上传到云服务器。即先在一个终端登录ssh，然后在另一个终端执行这个命令

scp -v -i path/to/your/private_key path/to/your/local/file username@remoteIP:path/to/your/target/file 
# 对应下载文件的命令
# scp -v -i path/to/your/private_key username@remoteIP:path/to/your/remote/file path/to/your/local/target/file

如果是系统库缺失，比如cv2, 则需要在python虚拟环境里安装，而不是在服务器系统的python环境里安装。因为不同的工作流，可能依赖的是不同的库版本，需要独立维护各个工作流的库版本号
一个特殊的case是需要2.34版本的glibc，因为这个库是Linux系统核心库，直接升级会导致系统不稳定，所以参考这篇文章，用本地编译2.34的glibc版本并通过patchelf使用的方法解决
在同一台Linux机器上安装多个版本GLIBC
最后执行的命令是:

cd /home/opc/local/python/env/comfy_try_on/lib/python3.12/site-packages/bitsandbytes
patchelf --add-rpath /opt/glibc234/lib libbitsandbytes_cuda126.so

但这个只是让so链接上2.34版本的glibc,python 运行时仍然会先加载2.28的glibc。所以通过

sudo /home/opc/local/python/env/comfy_try_on/bin/patchelf --set-interpreter /opt/glibc234/lib/ld-linux-x86-64.so.2 /home/opc/local/python/env/comfy_try_on/bin/python3
LD_LIBRARY_PATH=/opt/glibc234/lib:$LD_LIBRARY_PATH /home/opc/local/python/env/comfy_try_on/bin/python3 -m bitsandbytes

让python也使用2.34的glibc库
但又报gcc的错，因为GLIBC是系统核心库，用了glibc2.34的库，在调用gcc时，又与glibc2.34版本不兼容
至此放弃了2.34的版本。改为使用bitsandbytes老版本0.42.0
直接install安装的话，运行python -m bitsandbytes时报错，提示请源码编译。按提示的步骤源码编译

git clone https://github.com/TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=126
python setup.py install

但因为是checkout 0.42.0的版本
git checkout 4870580f17767a4165ec05d954dff2b5c25b694d
会报错so库不存在libbitsandbytes_cuda126.so
所以需要cd bitsandbytes后先执行make CUDA_VERSION=126把so库编译出来
最后剩下的错误是
ModuleNotFoundError: No module named 'triton.ops'
但其实triton库已经安装，找到原因是老版本的bitsandbytes需要老版本的triton库，所以triton也降级到2.3.1版本
这时候就只有warning了:

torch 2.7.0 requires triton==3.3.0; ... but you have triton 2.3.1 which is incompatible.
torchscale 0.3.0 requires timm==0.6.13, but you have timm 1.0.15 which is incompatible.

即PyTorch 2.7.0 官方要求 triton==3.3.0，现在降级成了 2.3.1，部分 PyTorch 的功能（如 torch.compile、torch.inductor）可能无法用 triton 加速，但大部分基础功能不受影响。
看某个库被其他库依赖的情况
pipdeptree --reverse --packages protobuf
然后运行工作流时又报错

ValueError: Calling `to()` is not supported for `4-bit` quantized models with the installed version of bitsandbytes. The current device is `cuda:0`. If you intended to move the model, please install bitsandbytes >= 0.43.2.

报错的代码行是

python3.12/site-packages/accelerate/big_modeling.py
 elif version.parse(importlib.metadata.version("bitsandbytes")) < version.parse("0.43.2"):
                raise ValueError(
                    "Calling `to()` is not supported for `4-bit` quantized models with the installed version of bitsandbytes. "
                    f"The current device is `{self.device}`. If you intended to move the model, please install bitsandbytes >= 0.43.2."
                )

本来想解决方案是不是把JoyCaption2或者bitsandbytes换成其他工作类似的节点试试，但在看JoyCaption2的模型列表时，发现同时包含这两个节点

image.png

猜测默认的第一个模型，应该是8bit转4bit的量化加速模型，所以选择第二个仅8bit的版本试试，结果工作流成功跑完，但估计速度比4bit会慢两三秒.
目前的工作流:

已经引入bitsandbytes老版本0.42.0和HyperL-FLUX-Accelerator两个加速器，换装耗时在80秒左右，和竞品Creati的30秒还有差距，会再看看有没有其他加速的途径
效果上还不对，估计是参数配置的问题，要调试一下

image.png
第二次运行就OOM了，点击释放GPU的按钮也没用，还需要调试一下OOM的原因

数据备份
通过rclone可以从Linux服务器备份文件到Google Drive，也可以下载;
也可以通过gdown下载(wget只能下载小文件)
rclone的配置步骤参考

rclone config
# n) New remote
# name> gdrive
# Storage> drive
# client_id> （可留空）
# client_secret> （可留空）
# scope> 1 (full access)
# root_folder_id> （可留空）
# service_account_file> （可留空）
# Edit advanced config? n
# Use auto config? n
# 复制提示的网址到本地电脑浏览器，登录并授权
# 把获得的 code 粘贴回服务器终端
# 配置完成

#这里的client ID 和 Secret 最好可以不为空，也就是执行以下步骤
Here are the steps:
Log in to the Google Cloud Console.
Create a new project (or select an existing project).
Enable the Google Drive API:
In the left menu, go to “APIs & Services” → “Library”.
Search for “Google Drive API”, click on it, and enable it.
Create OAuth 2.0 credentials:
Go to “APIs & Services” → “Credentials”.
Click “Create Credentials” → “OAuth client ID”.
For application type, select “Desktop app”.
Enter a name and click “Create”.
Obtain the client_id and client_secret:
After creation, the client_id and client_secret will be displayed. Copy them.
Enter them during rclone config:
When configuring the Google Drive remote in rclone, paste the client_id and client_secret you just obtained.
#如果提示要先建Content Screen, 则执行以下步骤
This message means that before you can create OAuth client credentials in Google Cloud Console, you need to set up the OAuth consent screen for your project.
Here’s what you should do:
Go to the Google Cloud Console:
https://console.cloud.google.com/
In the left menu, select “APIs & Services” → “OAuth consent screen”.
Configure the consent screen:
Choose “External” (recommended for most use cases) or “Internal” (if only for users in your organization).
Fill in the required fields (App name, User support email, Developer contact information, etc.).
You do not need to add any scopes or test users for basic rclone use (unless you want to restrict access).
Save and continue through the steps.
After the consent screen is configured, return to “Credentials” and create your OAuth client ID as described before.

然后通过此命令下载Google Drive文件夹到服务器里
rclone copy /home/ubuntu/local/ComfyUI/models/ hjm:'server-local/comfy/ComfyUI/models/' -P --ignore-existing
# 如果发现有部分文件未传上去，则再传一遍：
以存在的文件忽略
(comfy_try_on) [opc@instance-20250512-junmin ComfyUI]$ rclone listremotes
hjm:
(comfy_try_on) [opc@instance-20250512-junmin ComfyUI]$ rclone copy /home/opc/local/comfy/ComfyUI/models/diffusion_models hjm:'server-local/comfy/ComfyUI/models/diffusion_models' -P --ignore-existing
或者只更新新的文件
rclone copy /home/opc/local/comfy/ComfyUI/models/diffusion_models hjm:'server-local/comfy/ComfyUI/models/diffusion_models' -P --update
或者强制覆盖旧文件
rclone copy /home/opc/local/comfy/ComfyUI/models/diffusion_models hjm:'server-local/comfy/ComfyUI/models/diffusion_models' -P --ignore-times

疑难解决

如果启动main.py时报错
OSError: [Errno 98] error while attempting to bind on address ('127.0.0.1', 8188): [errno 98] address already in use
则先通过netstat -tuplen看是哪个服务在使用这个端口，如果是上次的ComfyUI服务没有退出的话，则直接先kill此服务：kill -9 PID
重启ComfyUI服务并加载工作流时，即使是很简单的工作流，也可能会报OOM, 是因为上次工作流运行时加载的模型和使用的VRAM, 还没有释放。所以关闭ComfyUI服务，或者启动ComfyUI服务加载工作流之前，都需要先调用一下清空VRAM的接口
下了新节点后，有可能引起各种冲突，所以需要在下节点前先运行
pipdeptree | grep mediapipe 保存当前所有的python 库的版本
然后安装节点的时候看log分别安装了哪些库并记录
这样后面如果有冲突，可以分析原因

在Oracle Linux服务器部署ComfyUI工作流

推荐阅读更多精彩内容