Module talk:R:Bailly

From Wiktionary, the free dictionary
Latest comment: 10 months ago by Mahagaja in topic Links are gone
Jump to navigation Jump to search

Search[edit]

@Isomorphyc: Now that it's ordered, we might want to implement a binary search. —JohnC5 05:21, 2 July 2016 (UTC)Reply

@JohnC5: I still don't quite get Lua data structures. If this headwords structure is a set of nested arrays, it ought to be faster than the original version, but if it is nested hash tables, it ought to be slower, because the >O(n) per page hash table construction is amortised over one O(log n) binary search. Nevertheless, this version looks the best. Isomorphyc (talk) 12:23, 2 July 2016 (UTC)Reply
@Isomorphyc: That is an interesting question. According to the Implementation docs for Lua 5.0:

“Unlike other scripting languages, Lua does not offer an array type. Instead, Lua programmers use regular tables with integer indices to implement arrays. Lua 5.0 uses a new algorithm that detects whether tables are being used as arrays and automatically stores the values associated to numeric indices in an actual array, instead of adding them to the hash table. This algorithm is discussed in Section 4.”

In theory, these should be array-like and not hash-tables. —JohnC5 17:41, 2 July 2016 (UTC)Reply
This is very helpful, thank you. I was reading the unofficial FAQ at the same time. So headwords is a multidimensional jagged array, probably with three heap allocations per word, one for the string, one for the array of page numbers, and one for the pair of pointers. I have flattened it one level, since I superfluously included adjacent page numbers. It is good to learn this on a non-demanding module; I should not have been treating the table structure so indifferently as I have been. Isomorphyc (talk) 18:38, 2 July 2016 (UTC)Reply
@Isomorphyc: Thanks for all the work you have done. I have an unrelated topic which may interest you: the Chinese languages use a lengthy module mod:zh, which has an even lengthier data module mod:zh/data. This latter module, as you might imagine, as been causing memory overflows recently and could do with a good sharding. Unfortunately the data module gets updated somewhat frequently, unlike our AG reference lookup data. Is there a way to shard the data in mod:zh/data, but have mod:zh/data remain for editing purposes, but have the shards update whenever the main module is updated? I know this update would be very expensive, but it would be far less expensive than loading the whole dataset for every single page that only uses a tiny portion of the data. It may also be the case that we can break up mod:zh/data into several sub-datasets, and only call those which are necessary for a given task. I thought, though, that your mod:data tables might be another viable solution. I'm also going to ping Wyang, Justinrleung, and Suzukaze-c as the primary arbiters of mod:zh. —JohnC5 19:17, 2 July 2016 (UTC)Reply
@JohnC5: This is very feasible, and I would be glad to help. I wouldn't want to reshard for every addition unless wholly necessary, [edit: because of my understanding of how Wiktionary compresses revisions]. Instead, in addition to the shard set, we have an Unsharded file for new data which takes precedence. Hence, it can be used for insert, update, and delete (by setting to nil), and can almost be used as the sole user-facing file, in conjunction with a module-generated table of what already exists. Since it seems new characters are added to mod:zh/data at a rate of no more than a few hundred a year, there's not really any necessity to reshard more than annually, if that. If the `unsharded' system is not desirable, it is possible, but messy, to use a robot to keep a user-facing file consistent with the shard set. The downsides to this method are latency and concurrency hazards. I would be glad to make a small prototype if the Chinese editors would like. The sharding code is the last link at User:OrphicBot, and the modules we use are located at Module:data tables and subpages named such as Module:data tables/dataC0, with increasing numerals to 99. I regret the present implementation is messy; it is a first version written in the last few days to solve a specific problem, and not really with a mind to being read. Features we didn't use are only stubs or sketches. Isomorphyc (talk) 22:05, 2 July 2016 (UTC)Reply
Module:zh/data needs splitting... I have split it into various subpages and made other modules call on it only when necessary. Memory requirement of the page has been reduced to about half of what it was before. I'm not sure I understand all of the above... the function b_search is returning the index associated with a word using a recursive halving algorithm - in the case of Module:zh/data, is the index to be used downstream in a separate data module to extract the relevant data? Is the benefit in memory from calling on a smaller data module? If so, the method in the function check_pron in Module:zh is perhaps helpful, which is using codepoint of the first character to determine the subpage location and used by {{zh-new}} in substitution. After splitting zh/data the subpages are much reduced in size; the ones which may benefit from further splitting are Module:zh/data/yue-pron (312,522 bytes) and maybe Module:zh/data/cmn-pron (188,978), although if the #subpages is not too high there are other arguably simpler ways of splitting. Wyang (talk) 00:51, 3 July 2016 (UTC)Reply
@Wyang: The changes you've made obviate the need for what we were describing. It was not very clear of us to contact you here inasmuch as this module doesn't use the process we were describing. A good example is the underlying data used by mod:R:Woodhouse through the mod:data tables interface. —JohnC5 01:57, 3 July 2016 (UTC)Reply
Hi @JohnC5, hi @Wyang: For example, if we were mapping the character '夫' to pinyin, we would take the UTF-8 representation of '夫' = [229, 164, 171], and calculate the sum(utf-8('夫')) modulo 100 = 564 % 100 = 64. Then we look in the 64th data file, which is something like [Module:data tables/dataC64], and load the exported table which would look like this: data = { ..., ['cmn_simplified_to_pinyin'] = { ..., {['夫'] = 'fū', ...}, ... }. Then we could just access: data['cmn_simplified_to_pinyin']['夫'] = 'fū' for our result. This loads only the 1% of the data. I am, to be honest, surprised you would have memory problems with a 500 KB file. Would it be possible for me to see an example of a page which has (or used to have) particularly bad memory performance? Thanks. Isomorphyc (talk) 02:45, 3 July 2016 (UTC)Reply
Thanks for the explanations. The articles previously having memory issues were and . Lots of things have been tried in the meantime, such as splitting Module:zh/data/ltc-pron and Module:zh/data, refining Module:zh-pron, and the memory usage of these articles are much improved. Wyang (talk) 09:26, 3 July 2016 (UTC)Reply
Hi @Wyang: It seems that your recent changes reduced the Chinese section's memory usage from 35 MB to 18 MB for these pages, while Translingual, Japanese, and Korean also use about 6 MB each. The Lua memory usage is slightly non-deterministic, it appears. The first line of Module:zh-pron is: local m_zh = require("Module:zh"). So far as I can tell, this variable is never used in the module, and the line can be removed, which would save a further 5 MB from most Chinese pages. It is possible I misunderstand the semantics of require(), but I tested it on a few pages and there do not seem to be any adverse effects. There might be other ways to reduce the memory use of these modules without sharding, and your current solution is better than sharding, I believe. Please let me know if you still have problems or this becomes an issue again-- hopefully there are some other ways to save some memory as well. Isomorphyc (talk) 14:24, 3 July 2016 (UTC)Reply
Thanks, I will let you know if memory problems arise in the future. I removed the line in zh-pron; I agree - it does not seem to be used in the module. Wyang (talk) 22:52, 3 July 2016 (UTC)Reply

Sometimes faulty[edit]

I'm sometimes sent to seemingly random pages. --Barytonesis (talk) 00:58, 27 June 2017 (UTC)Reply

Links are gone[edit]

@Benwing2: your most recent edit appears to have removed the links to the dictionary. —Mahāgaja · talk 18:54, 19 July 2023 (UTC)Reply