Donate to Remove ads

Got a credit card? use our Credit Card & Finance Calculators

Thanks to Anonymous,bruncher,niord,gvonge,Shelford, for Donating to support the site

Archiving TMF boards

Formerly "Lemon Fool - Improve the Recipe" repurposed as Room 102 (see above).
melonfool
Lemon Quarter
Posts: 2939
Joined: November 4th, 2016, 11:18 am
Has thanked: 1365 times
Been thanked: 794 times

Re: Archiving TMF boards

#5289

Postby melonfool » November 15th, 2016, 4:10 pm

TawnyOwl wrote:I cannot see any reason why saving the old TMF boards is worthwhile. Basically nobody reads old posts, it's a waste of time and effort.

Tawny


Hmm, that isn't true for me. I have a number of old posts bookmarked and go back to them. I often remember old posts and find them for people. I've sometimes written a long post in explanation of something which gets asked again and I can just link back to the first time it was answered instead of doing it all again.

I suspect the issue with archiving the boards would not be about how to do it or where to put them, but how to make them useful/searchable etc.

But it is beyond me to make any suggestions on this - too technical for me.

Mel

Lootman
The full Lemon
Posts: 19361
Joined: November 4th, 2016, 3:58 pm
Has thanked: 657 times
Been thanked: 6915 times

Re: Archiving TMF boards

#5299

Postby Lootman » November 15th, 2016, 4:29 pm

mc2fool wrote:That may be true for some boards but for others, e.g. the Pensions - Practical Problems board, folks regularly refer -- and link -- back to previous detailed explanations of complex matters and their discussions, so I must disagree with the assertion that "nobody" reads old posts and it's a waste of time and effort.

I do not personally have much use for old commentary but certainly do not object to it being retained if there is a demand for it, and the provision of it does not interfere with adding new functionality here. I see it as a priority but not the highest by any means.

The danger of referring to an old post is that it may no longer be correct due to changes in the law, or because events have overtaken it, or simply because the commentator knew less back then than he or she knows now. Taking your example of pensions, for instance, that strikes me as an area where the government is constantly meddling. If I have a question about a 2017 pension, I probably want a 2017 answer rather than trust a 2012 answer.

melonfool
Lemon Quarter
Posts: 2939
Joined: November 4th, 2016, 11:18 am
Has thanked: 1365 times
Been thanked: 794 times

Re: Archiving TMF boards

#5310

Postby melonfool » November 15th, 2016, 4:49 pm

Lootman wrote:
mc2fool wrote:That may be true for some boards but for others, e.g. the Pensions - Practical Problems board, folks regularly refer -- and link -- back to previous detailed explanations of complex matters and their discussions, so I must disagree with the assertion that "nobody" reads old posts and it's a waste of time and effort.

I do not personally have much use for old commentary but certainly do not object to it being retained if there is a demand for it, and the provision of it does not interfere with adding new functionality here. I see it as a priority but not the highest by any means.

The danger of referring to an old post is that it may no longer be correct due to changes in the law, or because events have overtaken it, or simply because the commentator knew less back then than he or she knows now. Taking your example of pensions, for instance, that strikes me as an area where the government is constantly meddling. If I have a question about a 2017 pension, I probably want a 2017 answer rather than trust a 2012 answer.


Ah, but what if you are going through all your years of pension to update your files and do a full financial review and you want to know what the rules were in 2012....?

:)

Mel

stooz
Site Admin
Posts: 1468
Joined: November 3rd, 2016, 11:03 pm
Has thanked: 10 times
Been thanked: 502 times

Re: Archiving TMF boards

#5412

Postby stooz » November 15th, 2016, 9:13 pm

A real benefit would be parts about directors dealings in shares for example. An historical tail as a guide to intended investments.

However we have been in discussions, lengthy and time consuming. We are looking at 40gb of data, many hours of conversion work, on going costs, continued legal protection costs and overall a bill over 5 figures... So don't get your hopes up.

hermit100
Lemon Pip
Posts: 79
Joined: November 4th, 2016, 11:35 am
Has thanked: 14 times
Been thanked: 2 times

Re: Archiving TMF boards

#5618

Postby hermit100 » November 16th, 2016, 1:36 pm

TawnyOwl wrote:I cannot see any reason why saving the old TMF boards is worthwhile. Basically nobody reads old posts, it's a waste of time and effort.


I often read old posts. Lots of really useful information there.

Gengulphus
Lemon Quarter
Posts: 4255
Joined: November 4th, 2016, 1:17 am
Been thanked: 2631 times

Re: Archiving TMF boards

#5665

Postby Gengulphus » November 16th, 2016, 3:13 pm

hermit100 wrote:I often read old posts. Lots of really useful information there.


Same here - and there's humour, glimpses into how I really thought about things in the past rather than how they've subsequently been fitted into my memory, and various other things as well. Enough stuff that I've saved hundreds of posts to my laptop or the WayBack Machine so far, concentrating on ones I need my user account to locate, and expect to save thousands more in the coming few months.

Having said that, thousands of posts is well under 0.1% of all the posts on TMF, and I'll probably never miss almost all of the rest...

Gengulphus

seekingbalance
2 Lemon pips
Posts: 163
Joined: November 7th, 2016, 11:14 am
Has thanked: 16 times
Been thanked: 66 times

Re: Archiving TMF boards

#5892

Postby seekingbalance » November 17th, 2016, 12:05 pm

As previously mentioned, I have been in touch with the Wayback machine archive team to see whether they can crawl the entire site.

They have finally replied to say they are looking into it, but while looking for a solution have also asked whether it is possible to get a list of every single URL for every post and thread, for entire thread views.

I have advised that I doubt this is possible but would ask, and have asked TMFTarantula whether he knows or can find out.

Anyone else from TMF know the answer?

Tony

Clariman
Lemon Quarter
Posts: 3288
Joined: November 4th, 2016, 12:17 am
Has thanked: 3134 times
Been thanked: 1566 times

Re: Archiving TMF boards

#5896

Postby Clariman » November 17th, 2016, 12:12 pm

An update from a Lemonfool perspective. Although we have had some discussions with other Fools who wanted to team up with us and to add archiving of the old boards, Stooz and I have decided to focus on keeping these boards active and productive. If others create an archive we will quite happily link to it, but we have no current plans to create our own archive - either the two of us or with others.

I know this will be a disappointment to some, but we need to keep it simple and focus on the task in hand.

Regards
Clariman & Stooz

mc2fool
Lemon Half
Posts: 8086
Joined: November 4th, 2016, 11:24 am
Has thanked: 7 times
Been thanked: 3125 times

Re: Archiving TMF boards

#7688

Postby mc2fool » November 22nd, 2016, 12:06 pm

seekingbalance wrote:As previously mentioned, I have been in touch with the Wayback machine archive team to see whether they can crawl the entire site.

They have finally replied to say they are looking into it, but while looking for a solution have also asked whether it is possible to get a list of every single URL for every post and thread, for entire thread views.

I have advised that I doubt this is possible but would ask, and have asked TMFTarantula whether he knows or can find out.

Anyone else from TMF know the answer?

Tony

Well, of course it's possible, being a matter of someone writing a scraper-cum-crawler to extract them all, but it's a bit of a curious way to go about it as I'd have thought (?) their normal crawler would get them anyway (as it already has for sections of the TMF boards at various snapshots over time).

There have been various methods proposed of archiving the old TMF boards (as well as this topic) and, AFAIAA, no recent news from any of them, so I'm just wondering if they're each on hold thinking one of the others is running with the archiving ball....

spiderbill
Lemon Slice
Posts: 549
Joined: November 4th, 2016, 9:12 am
Has thanked: 159 times
Been thanked: 184 times

Re: Archiving TMF boards

#7828

Postby spiderbill » November 22nd, 2016, 4:52 pm

seekingbalance wrote:As previously mentioned, I have been in touch with the Wayback machine archive team to see whether they can crawl the entire site.

They have finally replied to say they are looking into it, but while looking for a solution have also asked whether it is possible to get a list of every single URL for every post and thread, for entire thread views.

I have advised that I doubt this is possible but would ask, and have asked TMFTarantula whether he knows or can find out.

Anyone else from TMF know the answer?

Tony


I'm running a spider across it at the moment to see what sort of task we're facing. (May take a while.) Once I have the results I should be able to sort them into patterns unless there's too much data for Excel to handle. A few years ago I could have easily built a database to handle it but now that the PC world is all Access I'm likely to lose the will to live - just can't stand the interface ;-) A web-based on in mysql might be possible but we'll see.

seekingbalance
2 Lemon pips
Posts: 163
Joined: November 7th, 2016, 11:14 am
Has thanked: 16 times
Been thanked: 66 times

Re: Archiving TMF boards

#8309

Postby seekingbalance » November 23rd, 2016, 7:10 pm

Mc2fool - yes, I agree. I said the same to the guy who answered the email, but he has not yet replied to that further mail. I said that as they are already clearly doing some sort of crawl, but one that leaves out a lot of pages - there have been nearly 1000 crawls at the top page level - surely it is possible to just initiate a full crawl, and that surely they did not want a list with potentially around a million or more urls on it.

I also pointed out that they have claims on their site to have archives of a number of major sites, and asked how they got those done. But again, no reply as yet.

I'll give it a few more days, and go back again.


Spiderbill - great. It will be interesting to a team least get a number. Doesn't excel have a limit of around 1 million rows?

SB

spiderbill
Lemon Slice
Posts: 549
Joined: November 4th, 2016, 9:12 am
Has thanked: 159 times
Been thanked: 184 times

Re: Archiving TMF boards

#8398

Postby spiderbill » November 24th, 2016, 12:01 am

Well, unfortunately the spidering was very inconclusive - there were an awful lot of timeouts and an awful lot of "no such host" errors where the site just didn't respond. I've checked some of the examples of those errors and they work when going there manually but it seems the server can't handle large numbers of checks. Since I was only running about 28 threads that may be a clue as to why the Wayback Machine people were trying to get a listing - their bots may be getting similar tieouts and errors.

What I got was about 625,000 rows of data but of course many of the individual pages appear under different paramenters depending on how you get to them and how they are displayed
e.g
&sort=postdate
&sort=threaded
sort=collapsed
&sort=username
which tends to reduce the number of unique posts.

So it's clear from this that the spidering hasn't reached a lot of the pages, which themselves have links to other pages which haven't been found yet. Even as it was I got down to the 23rd level of link depth, which shows you what a complex site it is. I suspect we may have to lower our sites and concentrate on the more important forums. (whichever they are)

mc2fool
Lemon Half
Posts: 8086
Joined: November 4th, 2016, 11:24 am
Has thanked: 7 times
Been thanked: 3125 times

Re: Archiving TMF boards

#8408

Postby mc2fool » November 24th, 2016, 1:22 am

spiderbill wrote:So it's clear from this that the spidering hasn't reached a lot of the pages, which themselves have links to other pages which haven't been found yet. Even as it was I got down to the 23rd level of link depth, which shows you what a complex site it is. I suspect we may have to lower our sites and concentrate on the more important forums. (whichever they are)

I think exhaustive spidering isn't needed and you don't need to go down any level of links. Just get the list of posts in a board and "touch" each post to get its textual url (as well as the ?mid= one). Repeat for "Whole thread" urls.

PennyUK
Posts: 17
Joined: November 5th, 2016, 10:37 am
Has thanked: 2 times

Re: Archiving TMF boards

#8454

Postby PennyUK » November 24th, 2016, 9:16 am

Surely it is quite simple to get a list of all posts?

We know that posts can be accessed via URLs of the form http://boards.fool.co.uk/Message.aspx?mid=13419738
The highest possible 'mid' number has been posted here - that is the last ever message posted. We could trawl back to find the oldest message, ie the one with the lowest mid number.

We know that non-existent messages return the same message
eg http://boards.fool.co.uk/Message.aspx?mid=13496345

FWIW, the 'format for printing option' possibly returns the smallest size of message for checking: http://boards.fool.co.uk/MessagePrint.aspx?mid=12496345
Unfortunately formatting for printing does not have all the context (or even the number of recs). But the URL does not change when the message is returned.

So to get a list of messages, a program just loops through all the possible message numbers, and records which ones return a valid message.

Of course, that leaves the second half of the archive problem, which is to actually get the contents....

mc2fool
Lemon Half
Posts: 8086
Joined: November 4th, 2016, 11:24 am
Has thanked: 7 times
Been thanked: 3125 times

Re: Archiving TMF boards

#8476

Postby mc2fool » November 24th, 2016, 9:53 am

PennyUK wrote:We could trawl back to find the oldest message, ie the one with the lowest mid number.

This one? http://boards.fool.co.uk/Message.asp?mid=5667966

modellingman
Lemon Slice
Posts: 638
Joined: November 4th, 2016, 3:46 pm
Has thanked: 625 times
Been thanked: 377 times

Re: Archiving TMF boards

#8860

Postby modellingman » November 25th, 2016, 9:49 am

This thread has taken an interesting turn, so here's my two pennorth.

There's a hierarchy: Boards, Threads and Posts.

Each Board has a board-id. I've updated the list of the Boards and board-ids I previously posted to include all the Company boards. It is here (*).

Posts also have a unique message-id

A URL of the form

(1) boards.fool.co.uk/board-id.aspx?mid=message-id

provides the content of a Board as a "page" containing a list of,generally, 50 Posts in the form of links. The list starts with the Post which has the specified message-id.

The URLs sitting behind the "Prev" and "Next" links at the top of each list (and repeated at the foot of the list) provide message-id values which can be used generate the preceding and succeeding "page" of that Board's content.

If there is a conflict between the board-id and the message-id in (1), ie the specified Post does not belong to the specified Board, the board-id takes precedence and the list starts with the first Post with the next highest message-id. A useful trick is to set the message-id to 1. This lists the earliest content of the specified Board. The most recent content is accessed by omitting the querystring from URL (1). So the content of a board can be listed starting with the oldest and going forwards in time, or with the most recent and going backwards.

If 'sort=collapsed' is added to the querystring of (1), the returned list shows the first Posts in a series of Threads - it is, in effect, a list of Threads. The message-id of the first Post in a Thread can be used as a thread-id for the Thread, so a means on generating all the thread-ids is available for a given board-id.

Generating a list of just the message-ids associated with each thread-id is possible using 'sort=expanded' but looks a bit more messy to do.

Sticking with thread-ids, a URL of the form

(2) boards.fool.co.uk/thread-id.aspx?sort=whole

lists a Thread in its entirety. The authored content of each Post in the Thread is contained within a <blockquote> element which contains any <b>...</b>, <i>...</i>, <pre>...</pre> tags and any links which the author inserted. Useful. Analysis of the served HTML is needed to identify how to pluck out the other useful stuff (such as Date/Time submitted, Author and No.of Recs) and associate it with each Post's content. No doubt some have already done this analysis and implemented it.

So, there you have it, a threoretical and untested view of how to suck out lists of Post/Thread URLs and/or content from boards.fool.uk.

But wait, hasn't this been done already? See viewtopic.php?f=21&t=197#p3030

modellingman
Lemon Slice
Posts: 638
Joined: November 4th, 2016, 3:46 pm
Has thanked: 625 times
Been thanked: 377 times

Re: Archiving TMF boards

#8874

Postby modellingman » November 25th, 2016, 10:19 am

Each Board has a board-id. I've updated the list of the Boards and board-ids I previously posted to include all the Company boards. It is here (*).


Bad form to reply to my own post but I missed off:

(*) The list of Boards includes "Last Post" and "#Posts" columns. The data was compiled before the Fool closed its boards, so it is not the final position. For the purposes of identifying long unused or little used boards it is good enough. It was compiled using manual copy and paste from Fool webpages rather than using a script. There may be the odd board or two omitted from the list in error.

Gromley
Posts: 33
Joined: November 4th, 2016, 5:53 pm
Has thanked: 3 times
Been thanked: 1 time

Re: Archiving TMF boards

#11014

Postby Gromley » December 1st, 2016, 7:52 pm

I thought I read somewhere on another thread that stooz had already grabbed - the data of all of the posts?

If this was so then that to support the option of hosting the historical posts on an alternative site the "urgent" (ie before feb) has already been done.

It would be useful to clarify this, because if not I'm happy to dust off my previous scrapper and look at grabbing the data.

But also if this is so then it may be that stooz's data is an easier way to address the list of urls that the time machine would be looking for.

Time machine is potential the best option I believe as it takes out all of the hassle of hosting, writing new code and any potential legal / ownership issues.

It's fairly easy to create a list of urls that would allow access to each post in either threaded or non-threaded mode.

As has been said early each post can be accessed in the format : http://boards.fool.co.uk/Message.asp?mid=5667966

Each thread can be accessed in the format http://boards.fool.co.uk/Message.asp?mid=5667966&sort=whole#5667966

However I presume time machine would need to recreate all of the (relevant) links involved in interaction with the site and this is far more problematical I think. For example in the first link above, the actual address that the TMF site uses to access the thread is http://boards.fool.co.uk/greetings-fool-5667966.aspx?sort=threaded and it is a different url if you access the threaded view from a post later in the thread ie http://boards.fool.co.uk/Message.aspx?mid=5667966&sort=whole#11641184 where the second number is the 'mid' of the message from which you accessed the threaded view.

Also worth noting that if you access the list of posts on a given board, the url is http://boards.fool.co.uk/ask-a-foolish-question-50000.aspx?mid=11641184 where the number is the 'mid' of the post that will be top of the list and then don't forget you can access this list view sorted by mid , user, title, recs and in threaded or unthreaded format.

The list of options is finite, but immense I think; and if the time machine would need to store the results of each individual url , I supect it would be unmanageable.

I must confess I'd hadn't given much thought to how the time machine works previously, but having now done so - I suspect it would be un-manageable, but I'd be delighted to proven wrong (or for someone to come up with a workable solution where not all of the url combinations worked but still allowed access to all the post / boards / threads)

So if we are back to having to scrape the data. If I'm wrong and stooz hasn't already done this, I don't in fact think it is unmanageable.

As noted above the first post is apparently : http://boards.fool.co.uk/Message.asp?mid=5667966 which is TMFSki's welcome to the boards post (although I seem to recall from when I ran my automated scraper that I found an anomalous earlier post apparently from before the boards were launched - but the loss of that would not be an issue I suspect).

The last post I could find from a manual scan just now was http://boards.fool.co.uk/MessagePrint.aspx?mid=13460917

posted by "bitstrange" and poetically has the title "So sad... " and content of : "I just can't believe how emotional I'm feeling... :-( "

amen to that!

So that is just shy of 7.8m post (including any deleted ones - which will just return the quirky "post not found" message.

A manageable number I would think.

Would be great to understand from stooz whether or not the scrapeing has actually been done and if anyone has any thoughts on how to make time machine feasible that would be even better.

Regards,

Gromley

Gromley
Posts: 33
Joined: November 4th, 2016, 5:53 pm
Has thanked: 3 times
Been thanked: 1 time

Re: Archiving TMF boards

#13490

Postby Gromley » December 8th, 2016, 9:15 pm

Not sure if interest in this topic has waned given the lack of responses, but I had a few more thoughts that I think might be useful.

I'll try to keep it short and snappy (although that is not my forte!)

There are two options that have been considered (although I'll come to a possible third at the end of this post); 1- creating & hosting a replica of the TMF boards & structure and 2- using the waybackmachine (I kept calling it the time-machine in my last post - doh) to provide a view of how the boards were at date X. I'll take these in turn.

1- creating & hosting a replica of the TMF boards & structure

I had missed stooz's post on this topic from 15-Nov :

stooz wrote:A real benefit would be parts about directors dealings in shares for example. An historical tail as a guide to intended investments.

However we have been in discussions, lengthy and time consuming. We are looking at 40gb of data, many hours of conversion work, on going costs, continued legal protection costs and overall a bill over 5 figures... So don't get your hopes up.


So to cover the issues stooz raises one by one :
> We are looking at 40gb of data - In fact with efficient data storage (not including text compression which might save more) it will actually be about 8gb - which is probably more managable
> many hours of conversion work - DONE!
> on going costs, - Basically just the costs of hosting - not a massive financial amount I would think. As I don't know the future of my hosted sites at this stage it is not something I can offer currently.
> continued legal protection costs - Presumably this is by way of liability insurance in the case of any comment being called up as libel etc? Not sure how much this would be, but presumably would diminish over time as the posts age?
>and overall a bill over 5 figures - not sure what those costs are? The insurance and ongoing hosting would the only one actually required.

It seems to me that this option is highly feasible. I could personally deliver a complete website that mirrors the functionality of the TMF boards (including reinstating Best of Boards (with date filters) that is currently not available in the read-only version - plus maybe some of the other features that I recall have been on peoples wish lists in the past).

However as I mentioned above, I'm not currently prepared to stand up to hosting this ongoing nor taking on the "legal cover"; but if someone else were prepared to lead on this and has a hosting package that includes php & MySql - I'd be happy to deliver the data and the techy stuff.

{Favourite Fools & personal recommendations would not be included and they are not available now, but potentially a new solution could allow users on the new platform to select their favourite Fools and posts to aid navigation}

2. way back machine

I had never previously considered how this works; but I think now that it works on "static" webpages and records the returned page content for each of it's list of URLs.

This is then problematic for a data driven site like the fool as there are multiple URLs that can result in the same content being presented, so each page view would have to be stored multiple times and I am pretty sure that this would be seriously beyond the scope of what they do.

I could write reams on this issue, but I think one issue will suffice to illustrate the problem :

When you look at a list of posts on a board the url is (for eg) :

http://boards.fool.co.uk/investment-strategies-

Code: Select all

[color=#0000FF]50090[/color]
.aspx?mid=13454393&sort=postdate

Where the blue text is the id of the board , the red text is the post number that will appear at the top of the list and the green text is the sort order.

So for each board it would be required to store at least the number of posts x the number of sort options as individual archived web-pages.

Then there is threaded vs unthreaded & a number of additional options besides - it's definitely a finite number, but I can't begin to imagine how big.

BUT The point has been made that you can address each individual post on the boards by using the format : http://boards.fool.co.uk/Message.aspx?mid=13419738 where 13419738 can be any integer in the range 5667966 to 3460917.

That's c. 7.8m individual pages to ask the waybackmachine to archive - I guess that would be manageable?

What I also found though is that the url above actually produces a redirect code to the "actual" address of the webpage in the format http://boards.fool.co.uk/i-am-not-a-fan ... 19738.aspx

where the text bit appears to be the first few words /characters of the post content and the number at the end is the message id as per the earlier format.

If needed I could provide a list of the less than 7.8m urls in this format

And by capturing those all of the CONTENT of the TMF boards could be preserved in perpetuity - however this would not preserve the functionality. Clicking on any (or at least most) of the links on a given post would result in a URL that had not been preserved.

So this is where I come to the third possible option, inevitably :

3. A HYBRID

My first option above - the "mirror site" essentially has 2 components - (1) the post headers and all the clever (not too) code that deals with the navigation [About 1Gb of the data] and (2) the actual content of each post, flat text including html commands [7Gb of the data].

So IF the wayback machine were able to preserve the 7.8m (slightly less) individual posts, then a hosted navigation service could easily allow navigation around the site. (I could even translate any links embedded in the post contents).

I wonder if this would also address the legal liability point? The hosted site would only be addressing content published by another provider - the waybackmachine and presumably they already have some mechanism protecting themselves from any archived comment that may be considered libelous?

Anyway I clearly failed on the brevity point, but hopefully this may have given a possible new direction to this?


Cheers,

Gromley

Clariman
Lemon Quarter
Posts: 3288
Joined: November 4th, 2016, 12:17 am
Has thanked: 3134 times
Been thanked: 1566 times

Re: Archiving TMF boards

#13503

Postby Clariman » December 8th, 2016, 10:02 pm

Hi Gromley. The technical aspects may be the easiest part of it. When Stooz and I looked at this with two other Fools, in discussion with TMF, the costs were into 5 figures if I recall. You address some of them but not the fact that TMF own the data and the copyright. To legitimately copy it there may be licensing and services costs involved, in addition to insurances.

Clariman


Return to “Room 102 - Site Issues, Complaints & General Chat”

Who is online

Users browsing this forum: No registered users and 15 guests