[pmwiki-users] Trouble with .pageindex when too much _new_ data to index (+ sqlite)

ABClf languefrancaise at gmail.com
Wed Jan 28 17:45:44 CST 2015


Pierre, Petko, thank you for helpful answers.
I'm going to investigate and play more with $PageIndexTime variable
although I feel the issue I'm facing happens prior to indexing.

As of now, I say : after I import a big amount of new data (in sqlite
recipe), pmwiki fails to start indexing and something runs out of memory
before it starts indexing. Thereafter it's dead. Please note PmWiki is
still working fine for printing out pages, for linking, etc., no matter the
quantity of pages.
Fails happened several times to me this evening and the only working way
was to limit the amount of new imported data. (do not import 60 mo of new
data, but import 6 mo and do it 10 times).

​(Pierre) ​
> As far as I remember, PmWiki doesn't check to see if the amount of new
> data is acceptable to index.  PmWiki simply has a list of pages that it
> knows aren't in the index, indexes as many as it can in $PageIndexTime
> seconds, and leaves the rest for a later request to index.


What if PmWiki has suddenly 10k new pages to index ? It may take time to
list all of these new pages, and to get ready to index them in the next
step. In my simulation case, new pages are very short in size, yet they are
quite a lot (90000 quotes = 90000 pages ;)). How would I debug a bottleneck
if ever in the very early task of indexation ?

​(Petko) ​
> There is a 10 seconds default limit for the indexation work, that is, if
> there are more pages that haven't been indexed, they will be dropped and
> will be indexed on the next search.


Indeed. Yet in my case, when importing to much data, indexation doesn't
start, because it's failing in the prior step.

Tested just now :
I alter .pageindex, .flock and .lastmod 's names in wiki.d, so it's like
there is nothing else the 100 mo sqlite database, and I run a search in the
browser.
PmWiki has to index a sql database made of 100000 new pages ;) I see my
hard drive led running fast, I feel temperature going higher, (really, some
processing is happening), and finally error message is printed out on
browser screen : Fatal error: Maximum execution time of 30 seconds exceeded
in D:\xampp3\htdocs\abclf\scripts\xlpage-utf-8.php on line 75

Checking in wiki.d : a new 0 byte .flock file is has been created. No new
.pageindex

I rerun a search. Hard drive rerun, and message error : Fatal error:
Maximum execution time of 30 seconds exceeded in
D:\xampp3\htdocs\abclf\pmwiki.php on line 2015

No change in wiki.d folder.

Last rerun, and last error message is : Fatal error: Maximum execution time
of 30 seconds exceeded in D:\xampp3\htdocs\abclf\cookbook\sqlite.php on
line 403

Please note the hard drive looks like it's stressed. I experienced 2 forced
reboots on my laptop (which doesnt like to get stressed).

Of course I can share my 100 mo sqlite database for further testings.
In case you are interested, in Quote group, I'm inserting 90000 of that
kind of data :

--------------insert--------------------
INSERT INTO pages (author, charset, name, targets, text)
VALUES ("gb", "UTF-8", "Citation.95463", "Bob.3580,Source.3957",
"[[#citation]]
Marcault crut voir le Corse blémir. –Il nous colle la trouille, ce gars-là,
reconnut Filippi.
[[#citation_]]

bob:3580
source:3957");
--------------insertend-----------------

For Word group, here it is the kind of data (70000) :

--------------insert--------------------
INSERT INTO pages (author, charset, ctime, targets, name, title, text)
VALUES ("gb", "UTF-8", "1401924338", "Synonyme.81","Bob.73244","qu'est-ce
que vous voulez que ça me fasse ?","(:title qu'est-ce que vous voulez que ça
me fasse ?:)

[[#fpa]]
vedette: qu'est-ce que vous voulez que ça me fasse ?
variantes:
[[#fpa_]]

[[#fr]]
Indifférence
[[#fr_]]

[[#etymologie]]

[[#etymologie_]]

[[#tlfi]]

[[#tlfi_]]

[[#traduction]]

[[#traduction_]]

[[#remarque]]

[[#remarque_]]

[[#attestation]]

[[#attestation_]]
attestation_c:
grammaire:
synonyme: 81
morphologie:
usage:
famille:
registre_origine:
registre_actuel: 5");
--------------insertend-----------------

I dont know if PmWiki is the most pertinent tool for this kind of job, and
I know I'm not good enough to use it at its best. And I feel a little ashame
to show my messy kitchen ;) Say I'm testing. I know PmWiki can not replace
a sql database, yet it has a lot of power to give.


Gilles.



2015-01-28 23:16 GMT+01:00 Petko Yotov <5ko at 5ko.fr>:

> On 2015-01-28 22:10, ABClf wrote:
>
>> Main issue encountered is how .pageindex is handling its indexation
>> task. It sounds like it definitely stops working when the amount of
>> _new_ data is too big.
>> I mean, the process looks like it evaluates first the amount of new
>> data, rather than starting to index, thus, in case there is too much
>> new data, you get a memory error message, and the game is over. I wish
>> the pageindexing would work, and work, and work, no matter how much
>> new data there is to index, until its done.
>>
>> If the amount of new data is acceptable, then he will start making the
>> index. Not in one time : you will have to ask him several times, but
>> at the end (search 10 times, more or less), you know its done, and you
>> have not encoutered memory issue.
>>
>
> This is done by the function PageIndexUpdate() in scripts/pagelist.php.
>
> There is a 10 seconds default limit for the indexation work, that is, if
> there are more pages that haven't been indexed, they will be dropped and
> will be indexed on the next search.
>
> While pages are indexed, there shouldn't be a huge need for memory. After
> the terms of a page are compacted, they are written into the
> ".pageindex,new" file and dropped from memory (actually the values are
> replaced). Same for the next pages up to 10 seconds. After, the contents of
> the old ".pageindex" file is copied to ".pageindex,new" and then
> ".pageindex,new" is renamed to ".pageindex", replacing the old file. None
> of these operations should require a lot of memory.
>
> The only place I see where the memory usage can grow, is on line 773 of
> pagelist.php. This line adds the processed page name to an array, so that
> PmWiki knows that the page was already processed. If you have a huge number
> of pages, only the characters composing the page names may go over the
> memory limit. If your error messages mention this line 773, the problem is
> there.
>
> You can reduce the number of pages indexed (actually the number of seconds
> of continued indexing) by adding this in config.php:
>
>   $PageIndexTime = 5; # 5 seconds instead of 10
>
> I'll review the functions next weekend in case we are missing something.
>
> See also the recipe SystemLimits, you may be able to increase the memory
> limits.
>
>  Related question is : as I'm using sqlite for storing a big amount of
>> short and very short pages, why use the pmwiki .pageindex process
>> rather than performing a fulltext search ?
>>
>
> The SQLite PageStore() class only allows the "storage" of the pages into a
> single SQLite database file. The reasons, the pros and cons are explained
> in the recipe page.
>
> Other than "a fulltext search from the SQLite database is not yet
> written", I think the built-in search using .pageindex will perform much
> faster than a fulltext database search.
>
> Petko
>
>
> _______________________________________________
> pmwiki-users mailing list
> pmwiki-users at pmichaud.com
> http://www.pmichaud.com/mailman/listinfo/pmwiki-users
>



-- 

---------------------------------------
| A | de la langue française
| B | http://www.languefrancaise.net
| C | languefrancaise at gmail.com
---------------------------------------
       @bobmonamour
---------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pmichaud.com/pipermail/pmwiki-users/attachments/20150129/96467e62/attachment-0001.html>


More information about the pmwiki-users mailing list