When crawlers like edX-downloader make requests on courseware, they are often concurrently loading many units in the same sequence. This causes contention for the rows in courseware_studentmodule that store the student's state for various XBlocks/XModules, most notably for the sequence, chapter, and course -- all of which record and update user position information when loaded. It would be nice if we could actually remove these writes altogether and come up with a cleaner way of keeping track of the user's position. In general, GETs should be side-effect free. However, any such change would break backwards compatibility, and would require close coordination with research teams to make sure they weren't negatively affected. This commit identifies crawlers by user agent (CrawlersConfig model), and blocks student state writes if a crawler is detected. FieldDataCache writes simply become no-ops. It doesn't actually alter the rendering of the courseware in any way -- the main impact is that the blocks won't record your most recent position, which is meaningless for crawlers anyway. This can also be used as a building block for other policy we want to define around crawlers. We just have to be mindful that this only works with "nice" crawlers who are honest in their user agents, and that significantly more sophisticated (and costly) measures would be necessary to prevent crawlers that try to be even trivially sneaky. [PERF-403]
46 lines
1.5 KiB
Python
46 lines
1.5 KiB
Python
"""
|
|
This module handles the detection of crawlers, so that we can handle them
|
|
appropriately in other parts of the code.
|
|
"""
|
|
from django.db import models
|
|
|
|
from config_models.models import ConfigurationModel
|
|
|
|
|
|
class CrawlersConfig(ConfigurationModel):
|
|
"""Configuration for the crawlers django app."""
|
|
known_user_agents = models.TextField(
|
|
blank=True,
|
|
help_text="A comma-separated list of known crawler user agents.",
|
|
default='edX-downloader',
|
|
)
|
|
|
|
def __unicode__(self):
|
|
return u'CrawlersConfig("{}")'.format(self.known_user_agents)
|
|
|
|
@classmethod
|
|
def is_crawler(cls, request):
|
|
"""Determine if the request came from a crawler or not.
|
|
|
|
This method is simplistic and only looks at the user agent header at the
|
|
moment, but could later be improved to be smarter about detection.
|
|
"""
|
|
current = cls.current()
|
|
if not current.enabled:
|
|
return False
|
|
|
|
req_user_agent = request.META.get('HTTP_USER_AGENT')
|
|
crawler_agents = current.known_user_agents.split(",")
|
|
|
|
# If there was no user agent detected or no crawler agents configured,
|
|
# then just return False.
|
|
if (not req_user_agent) or (not crawler_agents):
|
|
return False
|
|
|
|
# We perform prefix matching of the crawler agent here so that we don't
|
|
# have to worry about version bumps.
|
|
return any(
|
|
req_user_agent.startswith(crawler_agent)
|
|
for crawler_agent in crawler_agents
|
|
)
|