pmenichelli Github contribution chart
pmenichelli Github Stats
pmenichelli Most Used Languages

Activity

02 Dec 2022

Issue Comment

Pmenichelli

Python error when running own scrapper

I have made the documentation site for the company I work for using Docusaurus. I integrated Algolia's Docsearch with it and I'm having some strange python errors when our CI runs the scraper.

These errors do not happen consistently, sometimes the CI runs the scrapper successfully. The website we scrap is https://docs.surfly.com and the config file I'm using for the crawler is the following

{
  "index_name": "surfly-docs",
  "start_urls": [
    "https://docs.surfly.com/"
	  ],
  "sitemap_urls": [
    "https://docs.surfly.com/sitemap.xml"
  ],
  "sitemap_alternate_links": true,
  "stop_urls": [
    "/tests"
  ],
  "selectors": {
    "lvl0": {
      "selector": "(//ul[contains(@class,'menu__list')]//a[contains(@class, 'menu__link menu__link--sublist menu__link--active')]/text() | //nav[contains(@class, 'navbar')]//a[contains(@class, 'navbar__link--active')]/text())[last()]",
      "type": "xpath",
      "global": true,
      "default_value": "Documentation"
    },
    "lvl1": "header h1",
    "lvl2": "article h2",
    "lvl3": "article h3",
    "lvl4": "article h4",
    "lvl5": "article h5, article td:first-child",
    "lvl6": "article h6",
    "text": "article p, article li, article td:last-child"
  },
  "strip_chars": " .,;:#",
  "custom_settings": {
    "separatorsToIndex": "_",
    "attributesForFaceting": [
      "language",
      "version",
      "type",
      "docusaurus_tag"
    ],
    "attributesToRetrieve": [
      "hierarchy",
      "content",
      "anchor",
      "url",
      "url_without_anchor",
      "type"
    ]
  },
  "conversation_id": [
    "833762294"
  ],
  "nb_hits": 46250
} 

If anyone has some tip of what could be going wrong it would really help me. So far googling these errors didn't help much as the stack trace doesn't reference any of the scraper source files, they seem more like python errors so I end up looking at random results in google.

Here are the stack traces when the CI fails running the job.

The command is:

podman run --rm --env-file=.env -e "CONFIG=$(cat ./docsearch-config.json | jq -r tostring)" algolia/docsearch-scraper

Error stack traces:

[       2ms] > Running command: podman run --rm --env-file=.env -e "CONFIG=$(cat ./docsearch-config.json | jq -r tostring)" algolia/docsearch-scraper
[     396ms] Traceback (most recent call last):
[     396ms]   File "/usr/local/bin/pipenv", line 7, in <module>
[     396ms]     from pipenv import cli
[     396ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/__init__.py", line 22, in <module>
[     396ms]     from pipenv.vendor.urllib3.exceptions import DependencyWarning
[     396ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/urllib3/__init__.py", line 7, in <module>
[     396ms]     import logging
[     396ms]   File "/usr/lib/python3.6/logging/__init__.py", line 28, in <module>
[     396ms]     from string import Template
[     396ms]   File "/usr/lib/python3.6/string.py", line 77, in <module>
[     396ms]     class Template(metaclass=_TemplateMetaclass):
[     396ms]   File "/usr/lib/python3.6/string.py", line 74, in __init__
[     396ms]     cls.pattern = _re.compile(pattern, cls.flags | _re.VERBOSE)
[     396ms]   File "/usr/lib/python3.6/re.py", line 233, in compile
[     396ms]     return _compile(pattern, flags)
[     396ms]   File "/usr/lib/python3.6/re.py", line 301, in _compile
[     396ms]     p = sre_compile.compile(pattern, flags)
[     396ms]   File "/usr/lib/python3.6/sre_compile.py", line 562, in compile
[     396ms]     p = sre_parse.parse(p, flags)
[     396ms]   File "/usr/lib/python3.6/sre_parse.py", line 855, in parse
[     396ms]     p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
[     396ms]   File "/usr/lib/python3.6/sre_parse.py", line 416, in _parse_sub
[     396ms]     not nested and not items))
[     396ms]   File "/usr/lib/python3.6/sre_parse.py", line 765, in _parse
[     396ms]     p = _parse_sub(source, state, sub_verbose, nested + 1)
[     396ms]   File "/usr/lib/python3.6/sre_parse.py", line 416, in _parse_sub
[     396ms]     not nested and not items))
[     396ms]   File "/usr/lib/python3.6/sre_parse.py", line 765, in _parse
[     396ms]     p = _parse_sub(source, state, sub_verbose, nested + 1)
[     396ms]   File "/usr/lib/python3.6/sre_parse.py", line 416, in _parse_sub
[     396ms]     not nested and not items))
[     396ms]   File "/usr/lib/python3.6/sre_parse.py", line 764, in _parse
[     396ms]     not (del_flags & SRE_FLAG_VERBOSE))
[     396ms] TypeError: unsupported operand type(s) for &: 'tuple' and 'int'
[     703ms] > Exit code: 1 
[       3ms] > Running command: podman run --rm --env-file=.env -e "CONFIG=$(cat ./docsearch-config.json | jq -r tostring)" algolia/docsearch-scraper
[     389ms] XXX lineno: 774, opcode: 163
[     391ms] Traceback (most recent call last):
[     391ms]   File "/usr/local/bin/pipenv", line 7, in <module>
[     391ms]     from pipenv import cli
[     391ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/__init__.py", line 22, in <module>
[     391ms]     from pipenv.vendor.urllib3.exceptions import DependencyWarning
[     391ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/urllib3/__init__.py", line 7, in <module>
[     391ms]     import logging
[     391ms]   File "/usr/lib/python3.6/logging/__init__.py", line 26, in <module>
[     391ms]     import sys, os, time, io, traceback, warnings, weakref, collections
[     391ms]   File "/usr/lib/python3.6/traceback.py", line 5, in <module>
[     391ms]     import linecache
[     391ms]   File "/usr/lib/python3.6/linecache.py", line 11, in <module>
[     391ms]     import tokenize
[     391ms]   File "/usr/lib/python3.6/tokenize.py", line 37, in <module>
[     391ms]     cookie_re = re.compile(r'^[ \t\f]*#.*?coding[:=][ \t]*([-\w.]+)', re.ASCII)
[     391ms]   File "/usr/lib/python3.6/re.py", line 233, in compile
[     391ms]     return _compile(pattern, flags)
[     391ms]   File "/usr/lib/python3.6/re.py", line 301, in _compile
[     391ms]     p = sre_compile.compile(pattern, flags)
[     391ms]   File "/usr/lib/python3.6/sre_compile.py", line 562, in compile
[     391ms]     p = sre_parse.parse(p, flags)
[     391ms]   File "/usr/lib/python3.6/sre_parse.py", line 855, in parse
[     391ms]     p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
[     391ms]   File "/usr/lib/python3.6/sre_parse.py", line 416, in _parse_sub
[     391ms]     not nested and not items))
[     391ms]   File "/usr/lib/python3.6/sre_parse.py", line 774, in _parse
[     391ms]     subpatternappend((AT, AT_BEGINNING))
[     391ms] SystemError: unknown opcode
[     709ms] > Exit code: 1 
[       2ms] > Running command: podman run --rm --env-file=.env -e "CONFIG=$(cat ./docsearch-config.json | jq -r tostring)" algolia/docsearch-scraper
[     677ms] Traceback (most recent call last):
[     677ms]   File "/usr/local/bin/pipenv", line 11, in <module>
[     677ms]     sys.exit(cli())
[     677ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/click/core.py", line 829, in __call__
[     677ms]     return self.main(*args, **kwargs)
[     677ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/click/core.py", line 781, in main
[     677ms]     with self.make_context(prog_name, args, **extra) as ctx:
[     677ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/click/core.py", line 700, in make_context
[     677ms]     self.parse_args(ctx, args)
[     677ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/click/core.py", line 1212, in parse_args
[     678ms]     rest = Command.parse_args(self, ctx, args)
[     678ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/click/core.py", line 1044, in parse_args
[     678ms]     parser = self.make_parser(ctx)
[     678ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/click/core.py", line 965, in make_parser
[     678ms]     for param in self.get_params(ctx):
[     678ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/click/core.py", line 912, in get_params
[     678ms]     help_option = self.get_help_option(ctx)
[     678ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/cli/options.py", line 27, in get_help_option
[     678ms]     from ..core import format_help
[     678ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/core.py", line 33, in <module>
[     678ms]     from .project import Project
[     678ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/project.py", line 30, in <module>
[     679ms]     from .vendor.requirementslib.models.utils import get_default_pyproject_backend
[     679ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/requirementslib/__init__.py", line 9, in <module>
[     679ms]     from .models.lockfile import Lockfile
[     679ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/requirementslib/models/lockfile.py", line 9, in <module>
[     680ms]     import plette.lockfiles
[     680ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/plette/__init__.py", line 8, in <module>
[     680ms]     from .lockfiles import Lockfile
[     680ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/plette/lockfiles.py", line 13, in <module>
[     680ms]     from .models import DataView, Meta, PackageCollection
[     680ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/plette/models/__init__.py", line 8, in <module>
[     680ms]     from .base import (
[     680ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/plette/models/base.py", line 2, in <module>
[     680ms]     import cerberus
[     680ms]   File "/usr/local/lib/python3.6/dist-packages/pipenv/vendor/cerberus/__init__.py", line 21, in <module>
[     680ms]     __version__ = get_distribution("Cerberus").version
[     680ms]   File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 472, in get_distribution
[     680ms]     dist = get_provider(dist)
[     680ms]   File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 344, in get_provider
[     680ms]     return working_set.find(moduleOrReq) or require(str(moduleOrReq))[0]
[     680ms]   File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 892, in require
[     681ms]     needed = self.resolve(parse_requirements(requirements))
[     681ms]   File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 765, in resolve
[     681ms]     env = Environment(self.entries)
[     681ms]   File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 976, in __init__
[     681ms]     self.scan(search_path)
[     681ms] AttributeError: 'Environment' object has no attribute 'scan'
[    1.096s] > Exit code: 1 

Forked On 02 Dec 2022 at 02:50:13

Pmenichelli

Or of there's a way to run the crawler in a more verbose mode so I can have a clue of what fails.

Commented On 02 Dec 2022 at 02:50:13

Pmenichelli

started

Started On 29 Sep 2022 at 06:39:49