[pmwiki-users] Core Spam Blcok Thoughts
Crisses
crisses at ofobscurity.com
Wed Apr 19 17:44:49 CDT 2006
On Apr 5, 2006, at 3:57 PM, Chris Lott wrote:
> It would be REALLY nice if the blocklist script were enhanced a bit to
> a) have an option for blocking only whole words, 2) be able to block
> using regular expressions, 3) have overrides so that a field could
> override a block in the sitewide config.
1(a) it's possible, but difficult, since a "whole word" can start or
end in something other than a space character. The reason this is
difficult is outlined in #2 below. Essentially it would require
using regexes. The partial workaround is to use whitespace, but that
will only catch some instances of the word, not all of them.
2) regex -- while I could POSSIBLY see having an additional syntax
like "regex:^word$" as an alternative to blocking, I don't recommend
this be done for the whole blocklist. The reason is one of server
resources. I have a blocklist that is enormous and EXCEPTIONALLY
effective. To use regex instead of string matching would grind my
server to a crawl. Regex is enormously more taxing than simple
string matches. If people want an option to turn scoring off, then
we could stop matches the moment that there's a positive match. I
like scoring my matches because it's much easier to pick out the
worst offenders, and easier to pick up the possible false positives
(people honestly trying to post who were blocked).
3) It's possible to add an unblocklist or to have markup in a
blocklist page that unblocks a term. The problem is that now you
have to change the parse process --
a) first you parse the list(s) and remove "block:" from the entries
-- now you have a list of what you're looking to block
b) pull the IP addresses out of the list, they are compared differently
c) check IP
d) now you parse the unblock list and remove items from the block
entries that are in the unblock entries
e) now you compare the list items one at a time with every word
posted on the page
f) parse regular expressions through the post
You're asking to add step d & f. If the blocklist is long, in step d
it has to do a needle-in-haystack search through every item in the
blocklist. Step e already takes a long time if the post is long and
the list is long. Step f has the potential to take even longer --
because regex parsing is enormously more complex for the server
processes to handle. If the post is long it could grind the server
to a halt -- YSMV (your server may vary).
This may not be a huge problem for people with servers on steroids,
but I would like to avoid the complexity. If a word is a problem on
a farm, then I'd suggest moving the term to the fields that need to
block it. Most words aren't like that, but if I had a medical field
in my farm, I would be in trouble. As it is, I had to make a
decision to remove common psychoactive medications from the
Kinhost.org blocklist so that users could possibly discuss that they
were put on lithium or etc.
However, as a rule, I don't see an issue with adding the regex
functionality with the blazons and cautions that excessive (mis)use
of it when a simple string could do is not recommended.
I know of one PmWiki installation that died -- the owner believes the
blocklist was the reason. Indeed, getting the blocklist page to come
up was one of the major problems.
Another caution -- your history on your blocklist can becomes
expansive. You should probably have the history purged frequently if
you are maintaining a large blocklist.
Crisses
--
Six hours in a car with two anime freaks - hopefully I'll survive
with my hair the same colour.
-Malcolm&
More information about the pmwiki-users
mailing list