From 52dceb172d6fe762bb60b670df61866fe86b6f17 Mon Sep 17 00:00:00 2001 From: Russell Bryant Date: Tue, 3 Jun 2025 21:09:13 -0400 Subject: [PATCH] [Docs] Add developer doc about CI failures (#18782) Signed-off-by: Russell Bryant Co-authored-by: Mark McLoughlin Co-authored-by: Cyrus Leung --- docs/contributing/ci-failures.md | 120 +++++++++++++++++++++++++++++++ 1 file changed, 120 insertions(+) create mode 100644 docs/contributing/ci-failures.md diff --git a/docs/contributing/ci-failures.md b/docs/contributing/ci-failures.md new file mode 100644 index 000000000000..4d8f78197f33 --- /dev/null +++ b/docs/contributing/ci-failures.md @@ -0,0 +1,120 @@ +# CI Failures + +What should I do when a CI job fails on my PR, but I don't think my PR caused +the failure? + +- Check the dashboard of current CI test failures: + ๐Ÿ‘‰ [CI Failures Dashboard](https://github.com/orgs/vllm-project/projects/20) + +- If your failure **is already listed**, it's likely unrelated to your PR. + Help fixing it is always welcome! + - Leave comments with links to additional instances of the failure. + - React with a ๐Ÿ‘ to signal how many are affected. + +- If your failure **is not listed**, you should **file an issue**. + +## Filing a CI Test Failure Issue + +- **File a bug report:** + ๐Ÿ‘‰ [New CI Failure Report](https://github.com/vllm-project/vllm/issues/new?template=450-ci-failure.yml) + +- **Use this title format:** + + ``` + [CI Failure]: failing-test-job - regex/matching/failing:test + ``` + +- **For the environment field:** + + ``` + Still failing on main as of commit abcdef123 + ``` + +- **In the description, include failing tests:** + + ``` + FAILED failing/test.py:failing_test1 - Failure description + FAILED failing/test.py:failing_test2 - Failure description + https://github.com/orgs/vllm-project/projects/20 + https://github.com/vllm-project/vllm/issues/new?template=400-bug-report.yml + FAILED failing/test.py:failing_test3 - Failure description + ``` + +- **Attach logs** (collapsible section example): +
+ Logs: + + ```text + ERROR 05-20 03:26:38 [dump_input.py:68] Dumping input data + --- Logging error --- + Traceback (most recent call last): + File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 203, in execute_model + return self.model_executor.execute_model(scheduler_output) + ... + FAILED failing/test.py:failing_test1 - Failure description + FAILED failing/test.py:failing_test2 - Failure description + FAILED failing/test.py:failing_test3 - Failure description + ``` + +
+ +## Logs Wrangling + +Download the full log file from Buildkite locally. + +Strip timestamps and colorization: + +```bash +# Strip timestamps +sed -i 's/^\[[0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}T[0-9]\{2\}:[0-9]\{2\}:[0-9]\{2\}Z\] //' ci.log + +# Strip colorization +sed -i -r 's/\x1B\[[0-9;]*[mK]//g' ci.log +``` + +Use a tool for quick copy-pasting: + +```bash +tail -525 ci_build.log | wl-copy +``` + +## Investigating a CI Test Failure + +1. Go to ๐Ÿ‘‰ [Buildkite main branch](https://buildkite.com/vllm/ci/builds?branch=main) +2. Bisect to find the first build that shows the issue. +3. Add your findings to the GitHub issue. +4. If you find a strong candidate PR, mention it in the issue and ping contributors. + +## Reproducing a Failure + +CI test failures may be flaky. Use a bash loop to run repeatedly: + +```bash +COUNT=1; while pytest -sv tests/v1/engine/test_engine_core_client.py::test_kv_cache_events[True-tcp]; do + COUNT=$[$COUNT + 1]; echo "RUN NUMBER ${COUNT}"; +done +``` + +## Submitting a PR + +If you submit a PR to fix a CI failure: + +- Link the PR to the issue: + Add `Closes #12345` to the PR description. +- Add the `ci-failure` label: + This helps track it in the [CI Failures GitHub Project](https://github.com/orgs/vllm-project/projects/20). + +## Other Resources + +- ๐Ÿ” [Test Reliability on `main`](https://buildkite.com/organizations/vllm/analytics/suites/ci-1/tests?branch=main&order=ASC&sort_by=reliability) +- ๐Ÿงช [Latest Buildkite CI Runs](https://buildkite.com/vllm/ci/builds?branch=main) + +## Daily Triage + +Use [Buildkite analytics (2-day view)](https://buildkite.com/organizations/vllm/analytics/suites/ci-1/tests?branch=main&period=2days) to: + +- Identify recent test failures **on `main`**. +- Exclude legitimate test failures on PRs. +- (Optional) Ignore tests with 0% reliability. + +Compare to the [CI Failures Dashboard](https://github.com/orgs/vllm-project/projects/20).