You are not logged in.

#1 2008-10-26 09:38:34

dav7
Member
From: Australia
Registered: 2008-02-08
Posts: 674

Idea to help whoever wants to have a crack at writing a new Arch forum

WOW, the title fit. big_smile

Okay, so I had an idea earlier for whoever wants to possibly write a new Arch forum: test data.

The current Arch forum is full of Linux-related posts, links to screenshots, code snippets and other data, and would make an excellent test case for a new forum, so I submit this idea to the forum administrators:

Would it be possible to export a subset (5k posts, perhaps) of the entire Arch forum, somewhere in the middle of where it started (01) and where it is now (08)? I suggest the middle because that's when the most controversial posts would have likely to been posted, as I'm guessing Arch was just settling in around then and the project was still being steered here and there a bit. A 2nd dump of a few K posts from 07-08 would be cool too, since that's going to feature more screenshots and user-related code dumps.

Of course, all personally identifiable information such as password (which might be set to a static value, so that logins as any given user would be easy (for testing purposes, of course)) and IP address would be anonymized. IPs might be converted to something between 256.256.256.256 and 999.999.999.999, but of course each IP wouldn't be randomized each time it was rewritten, so for example 12.34.56.78 might be converted to 258.334.555.987 and 1.2.3.4 might be converted to 435.764.567.429, with each of those "new" IPs being used every time the old one was encountered, so that the data still seperated one person from another and could be used for monetization purposes.

That done, it doesn't matter what format this data might come in; I assume one or more MySQL tables, which is fine. This site obviously runs PunBB, so any enterprising developers could look up how PunBB stores its data or just browse around the tables themselves to find out how the data is arranged.

Now, to answer the most likely thing you're thinking right now: "Why?"

Some people work a lot better when presented with a set of data they need to operate on. Plus, writing software that looks like it's working with a heap of "real" data instead of just a few test posts from the start generally feels more "complete", and since this is a Linux-related forum, the chances of a replacement forum system being very Linux-centric would be high so posts that lean toward that fact would be a real plus here.

I can't really answer any better than that; if anyone agrees with me on this feel free to share thoughts, opinions, etc.

-dav7

Last edited by dav7 (2008-10-26 09:41:13)


Windows was made for looking at success from a distance through a wall of oversimplicity. Linux removes the wall, so you can just walk up to success and make it your own.
--
Reinventing the wheel is fun. You get to redefine pi.

Offline

#2 2008-10-26 13:43:05

Dusty
Schwag Merchant
From: Medicine Hat, Alberta, Canada
Registered: 2004-01-18
Posts: 5,986
Website

Re: Idea to help whoever wants to have a crack at writing a new Arch forum

I'm planning to rewrite the forum but I'm very busy right now. Especially too busy to read such a long post.

Offline

#3 2008-10-26 14:51:48

wizzomafizzo
Member
From: Australia
Registered: 2005-12-05
Posts: 53
Website

Re: Idea to help whoever wants to have a crack at writing a new Arch forum

He wants a database dump of the forums. It would have taken you less time to read his post than it did to type out a pompous dick answer, congratulations.

Offline

#4 2008-10-26 14:57:05

catwell
Member
From: Bretagne, France
Registered: 2008-02-20
Posts: 207
Website

Re: Idea to help whoever wants to have a crack at writing a new Arch forum

<offtopic>

I had to randomize the IPs in Apache log files the way you (dav7) said a while ago and I used that (changed only the two last fields but that's easy to adapt if you want to change all of them):

#! /usr/bin/env perl
# ipscr.pl

use Digest::MD5 qw(md5_hex);

sub scr
{
  $a = shift;
  $h = md5_hex($a);
  return $1.hex(substr($h,10,2)).'.'.hex(substr($h,20,2)) if ($a =~ m/(\d{1,3}\.\d{1,3}\.)/);
}

while (<STDIN>)
{
 $_ =~ s/(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/scr($1)/ge;
 print;
}

like this:

for i in $(grep \.gz < loglist.txt); do echo $i; zcat $i | ./ipscr.pl > scr/$(echo $i | sed s/.gz//); gzip scr/$(echo $i | sed s/.gz//); done

It has the advantage to keep a real IP format (fields between 0 and 255) and not to use an old/new IP table (I used it on ~1GB of logs...). Of course two "old" IPs can be associated with the same "new" IP but the odds are low enough.

</offtopic>

Now back on topic : for the forum, I'd want something like Vanilla and its tags-based plug-in (see slicehost.net for an example), but let's just wait for Dusty's Django-powered bomb smile

Offline

Board footer

Powered by FluxBB