ForumPostersUnion.com


   

Go Back   Forum Posters Union > Search Engine Intelligence & Research > Spiders, Crawlers and web robots
Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

Spiders, Crawlers and web robots Intelligence on search engine spider bots and identification, bad bots from spam botnets, content scrapers, tools to identify web robots, blocking malicious bots.

Reply
 
Thread Tools
  #1  
Old 12-29-2007, 11:44 AM
AnthonyCea's Avatar
AnthonyCea AnthonyCea is offline
Publisher
 
Join Date: Feb 2006
Posts: 31,531
Twingbot/1.0 accoona.net

Accoona must have given their bot a new name and it seems they have purchased more crawling firepower also since this seems to be a new Data Center.

208.84.132.11 pfw002-g2-netb.si.it.accoona.net
Twingbot/1.0 (+http://www.twingbot.com/)
Reply With Quote
  #2  
Old 01-22-2008, 02:13 PM
AnthonyCea's Avatar
AnthonyCea AnthonyCea is offline
Publisher
 
Join Date: Feb 2006
Posts: 31,531
Here is the other record we have on AccoonaBot
Reply With Quote
  #3  
Old 01-28-2008, 08:24 AM
ScottG ScottG is offline
Member
 
Join Date: Jan 2008
Posts: 15
Quote:
Originally Posted by AnthonyCea View Post
Accoona must have given their bot a new name and it seems they have purchased more crawling firepower also since this seems to be a new Data Center.

208.84.132.11 pfw002-g2-netb.si.it.accoona.net
Twingbot/1.0 (+http://www.twingbot.com/)
Howdy,

Nope. Accoona still has it's own 'bot. (Though I don't work on Accoona so can't say much about it.) As for Twingbot, that's for a new product we're working on for a beta launch. It's designed to be well-behaved and robots.txt compliant. We've tested it extensively on a variety of platforms so it shouldn't be messing anything up. If you have any issues with it, we'd really appreciate knowing. Twingbot has it's own email at Twingbot - then put in the at symbol - plus twingbot and the dot com.

Scott Germaise
Director, Product Mangement
Twing
Reply With Quote
  #4  
Old 01-28-2008, 09:53 AM
AnthonyCea's Avatar
AnthonyCea AnthonyCea is offline
Publisher
 
Join Date: Feb 2006
Posts: 31,531
Scott welcome to the forum.

Can you explain a bit about why your host name resolves to accoona.net ?

Are you affiliated with them in any way or is Twingbot one of their projects ?
Reply With Quote
  #5  
Old 01-28-2008, 11:39 AM
ScottG ScottG is offline
Member
 
Join Date: Jan 2008
Posts: 15
Quote:
Originally Posted by AnthonyCea View Post
Scott welcome to the forum.

Can you explain a bit about why your host name resolves to accoona.net ?

Are you affiliated with them in any way or is Twingbot one of their projects ?
Sure. Twing.com is a separate division of Accoona Corp. So we - perhaps obviously - use the company's data center assets. We may get around to changing the DNS for Twing.com at some point to avoid any confusion between the corporate divisions. But it really doesn't matter as a priority as from a user's perspective, it's just another product.

The only folks who would really care about data centers or who owns the Twingbot would be those - such as yourself - who are more sophisticated and have some reason and the skills to check on who's doing what with a 'bot. What's most important though is that our 'bot behave itself properly. And twingbot has it's own home page www.twingbot.com so anyone with an issue can easily get to us. Actually, he has his own MySpace page as well. www.myspace.com/twingbot .

The Twing.com product that Twingbot will be working on will be launching soon; though I'm sorry I can't share a specific date just yet.

Scott
Reply With Quote
  #6  
Old 01-28-2008, 09:54 PM
AnthonyCea's Avatar
AnthonyCea AnthonyCea is offline
Publisher
 
Join Date: Feb 2006
Posts: 31,531
Scott, we always advise new search engines to publish a great spider information page, complete with the IP's they use to crawl the web for their own good, that way webmasters will not ban them, if you are not banned this gives you a much better opportunity to index a wide cross section of websites.

The more data you can give webmasters the better off you will be since there are so many hackers, content scrapers and spam bot nets spoofing search engine bots from blacklisted IP's.

We have banned the IP's of many spoofed GoogleBot content scrapers and MSNBot which were spoofed, we were able to sort these simply from knowing the IP's, but many webmasters simply ban a lot of bots since they don't want to bother, so create a great bot identification page and link to it in your User agent and you will have a lot less problems.
Reply With Quote
  #7  
Old 01-29-2008, 08:56 AM
ScottG ScottG is offline
Member
 
Join Date: Jan 2008
Posts: 15
Quote:
Originally Posted by AnthonyCea View Post
Scott, we always advise new search engines to publish a great spider information page, complete with the IP's they use to crawl the web for their own good, that way webmasters will not ban them,
Good thought, thanks. I'm looking into getting the specific IPs or at least a range.

Scott
Reply With Quote
  #8  
Old 01-30-2008, 12:14 AM
Sykko
Guest
 
Posts: n/a
sorry to step in... I just now signed up and found this forum searching about this issue.

anyway I agree with Anthony here...

as it is I have been quite confused as to why this bot is constantly on my forum browsing around (mostly at open directories)

I had never heard of accoona or of twingbot... I wasnt incredibly worried becuase accoona.com lead me to the search engine so I figured this was just a bot... but it's still acting a bit odd on my site so it would have been nice to know that it's a new product that you are working on. had the IP resolved out to be twingbot.com I would have visited that site instead of going to accoona...

but I found it so I'll be shooting an email shortly
Reply With Quote
  #9  
Old 01-30-2008, 12:19 AM
AnthonyCea's Avatar
AnthonyCea AnthonyCea is offline
Publisher
 
Join Date: Feb 2006
Posts: 31,531
Sykko, welcome to the forum.

Informing webmasters with complete information is essential to new search engines, most do not do a great job of providing comprehensive data, including the IP's they use to spider the web, this is unfortunate for them since many webmasters do simply ban the bots they can't get data from quickly.
Reply With Quote
  #10  
Old 01-30-2008, 12:25 AM
Sykko
Guest
 
Posts: n/a
Quote:
Originally Posted by AnthonyCea View Post
Sykko, welcome to the forum.

Informing webmasters with complete information is essential to new search engines, most do not do a great job of providing comprehensive data, including the IP's they use to spider the web, this is unfortunate for them since many webmasters do simply ban the bots they can't get data from quickly.
yeah, my gut thought was to block the IP until I realized that it was a bot...

im still unsure. it just lingers there... but as far as I know it's not eating up too much bandwidth
Reply With Quote
  #11  
Old 01-30-2008, 12:29 AM
AnthonyCea's Avatar
AnthonyCea AnthonyCea is offline
Publisher
 
Join Date: Feb 2006
Posts: 31,531
Most new engines indexing will stay around for about a week real heavy until they build an index, the SearchMeBot was here for about 2 weeks steady also. I have not seen it for about a week or so now, so I guess they got what they needed for now.
Reply With Quote
  #12  
Old 01-30-2008, 12:33 AM
Sykko
Guest
 
Posts: n/a
Quote:
Originally Posted by AnthonyCea View Post
Most new engines indexing will stay around for about a week real heavy until they build an index, the SearchMeBot was here for about 2 weeks steady also. I have not seen it for about a week or so now, so I guess they got what they needed for now.
good point. I guess when I think about it I have seen that kind of behavior before... my site grew rather fast so we had a few of the established search engines having to catch up after they finally decided to index the site.
Reply With Quote
  #13  
Old 01-30-2008, 12:37 AM
AnthonyCea's Avatar
AnthonyCea AnthonyCea is offline
Publisher
 
Join Date: Feb 2006
Posts: 31,531
Twingbot and Accoona are not anything to worry about, at least they identify themselves, what you really have to worry about are content scrapers and spam bot nets, that is why I watch who is online real close.
Reply With Quote
  #14  
Old 01-30-2008, 09:40 AM
ScottG ScottG is offline
Member
 
Join Date: Jan 2008
Posts: 15
Quote:
Originally Posted by Sykko View Post
as it is I have been quite confused as to why this bot is constantly on my forum browsing around (mostly at open directories)

I had never heard of accoona or of twingbot... I wasnt incredibly worried becuase accoona.com lead me to the search engine so I figured this was just a bot... but it's still acting a bit odd on my site so it would have been nice to know that it's a new product that you are working on. had the IP resolved out to be twingbot.com I would have visited that site instead of going to accoona...
Couple of items...

* When you say, "odd" could you be more specific? If the bot is doing something strange, I'd like to know what it is if you could spare a moment. Its default behavior is to get as much as it can in one connection, without abusing the privilege, then come back. (Basically decreasing the total number of connections required to gather info.)

* As far as the IP goes... yes, we know. We should really work on the DNS to get it to resolve differently. It's just one of those time to do it things and that it was just faster/easier to hitchhike on existing data center infrastructure. We figured people would find the twingbot.com easily enough given the string in the User-Agent.

Scott
Reply With Quote
  #15  
Old 01-30-2008, 09:48 AM
AnthonyCea's Avatar
AnthonyCea AnthonyCea is offline
Publisher
 
Join Date: Feb 2006
Posts: 31,531
Scott, there is a major epidemic of spam bot nets and content scrapers that have webmasters highly paranoid, your best bet is to publish a great spider identification page with all the IP ranges you crawl from quickly, that way webmasters will allow you to crawl.

Most will ban even legitimate bots now days unless they can quickly identify what you are doing from a link in your user agent.

We had a major problem with SearchMeBot when they were switching around Data Centers, we had to figure out that they were using different host names and data centers ourselves and we had banned one of their DC's and the respective IP's as a content scraper bot due to lack of proper host names resolving and a lack of information.

Lack of ID will kill a new search engine before it starts.
Reply With Quote
  #16  
Old 01-30-2008, 09:53 AM
Sykko
Guest
 
Posts: n/a
the biggest thing that I find odd is that it has been connected non-stop for the last couple-few days on my site. almost always it is focused on forum indexes (as opposed to individual threads)

if all it was doing was indexing my forum index pages then it should be done by now I would think. unless it is an unusually slow bot.

strangely enough it seems to revisit forums that it has viewed in the past...

(I know you already understand how a forum works but I need to lay it out again so that I can explain what the bot is doing)

for example say my forum has the following categories

Chat
Suggestions
Questions

and each section has threads

chat
--hello
--how ya doing
suggestions
--install something
--new skin
questions
--what's this site about?
--where are you?

and then each thread has posts in it... which are all listed on a single page

ok so now on this hypothetical forum... the bot starts off viewing "chat" and spends quite a bit of time sorting through pages I assume.

after a while (not sure how long... I can watch it and give you an idea if you would like) it will move on to "suggestions"

and will sit there for a while... being active enough to show times updating in the who's online box.

then it might look at a thread (so far it's only looked at very large threads on my site... it's favorite is a thread that has 8500 posts in it) but then it will move on to questions...

then I'll get distracted and come back later and find that it is back on chat again...
Reply With Quote
  #17  
Old 01-30-2008, 03:35 PM
ScottG ScottG is offline
Member
 
Join Date: Jan 2008
Posts: 15
Got it.

What you're seeing is the 'bot re-visiting main forum pages to get a sense of new activity and then seek out those new pages. And due to the dynamic nature of forums, we do try to get back to visit a lot.

Assuming we do well with our new product, ideally this will help all forum owners. Yes, you guessed it... Twing.com is basically a new search engine specific to online communities. It's still 'officially' in beta, though it's now out on the web. We'll be enhancing a variety of things over the next several weeks.

We believe that we can provide a ton of value in helping to surface community content and get people to boards and discussions more effectively then generalized search engines. At some point, I'll also put a blog post out there for Forum Owners that visit and point them to resources like this site so they can get more help. (We'll have our own little forums, but basically, we're going to be forum search. We're not looking to compete with the forum admin sites. Rather the opposite. We'll probably be buying some ads on them and such. Not so much in anticipation of getting a ton of traffic, but more to support the community.)

Scott
Reply With Quote
  #18  
Old 01-30-2008, 04:13 PM
Sykko
Guest
 
Posts: n/a
Quote:
Originally Posted by ScottG View Post
Got it.

What you're seeing is the 'bot re-visiting main forum pages to get a sense of new activity and then seek out those new pages. And due to the dynamic nature of forums, we do try to get back to visit a lot.

Assuming we do well with our new product, ideally this will help all forum owners. Yes, you guessed it... Twing.com is basically a new search engine specific to online communities. It's still 'officially' in beta, though it's now out on the web. We'll be enhancing a variety of things over the next several weeks.

We believe that we can provide a ton of value in helping to surface community content and get people to boards and discussions more effectively then generalized search engines. At some point, I'll also put a blog post out there for Forum Owners that visit and point them to resources like this site so they can get more help. (We'll have our own little forums, but basically, we're going to be forum search. We're not looking to compete with the forum admin sites. Rather the opposite. We'll probably be buying some ads on them and such. Not so much in anticipation of getting a ton of traffic, but more to support the community.)

Scott
well, I hadnt guessed it but I admit I am very excited to hear about such a search engine! it sounds great!

I definately look forward to seeing it... and since you say there isnt really anything funny goin on with your bot I will just let it do what it does
Reply With Quote
  #19  
Old 02-01-2008, 08:17 AM
AnthonyCea's Avatar
AnthonyCea AnthonyCea is offline
Publisher
 
Join Date: Feb 2006
Posts: 31,531
One thing you need to look at Scott is the fact that Twingbot has been stuck on this one thread for the last few days here on ForumPostersUnion.com.

08:14 AM Guest Viewing Thread
Hackers probing forums for security holes 208.84.132.11
Twingbot/1.0 (+http://www.twingbot.com/)
Reply With Quote
  #20  
Old 02-01-2008, 10:49 AM
ScottG ScottG is offline
Member
 
Join Date: Jan 2008
Posts: 15
Sending info to developer now. Will let you know outcome.
Reply With Quote
  #21  
Old 02-04-2008, 12:57 AM
Sykko
Guest
 
Posts: n/a
it's still stuck on my main forum indexes btw...

it doesnt even bother to look at threads...

I am thinking it's getting clogged at some point...

yall might want to look at that as well...
Reply With Quote
  #22  
Old 02-05-2008, 04:12 PM
ScottG ScottG is offline
Member
 
Join Date: Jan 2008
Posts: 15
Updates:

* The page for Twingbot.com has been updated per Anthony's suggestions to include the IP for the 'bot. As well as some additional information.

* The 'bot should not be sticking on index pages only. We may have gone back to check for new stuff, but it shouldn't be hammering away to the point where it's doing anything more than that. If you're still seeing this, please let me know if you would mind if our developer could talk to you directly if necessary to help us fix.

Thanks,
Scott
Reply With Quote
  #23  
Old 02-05-2008, 07:02 PM
AnthonyCea's Avatar
AnthonyCea AnthonyCea is offline
Publisher
 
Join Date: Feb 2006
Posts: 31,531
Thanks Scott for adding the IP addresses you use to crawl with TwingBot.

Many new search engines are adding this data to their spider ID page and linking to it in their user agent so webmasters are not left in the dark being forced to investigate the matter themselves. Many will simply ban new bots if this data is not supplied, so it is very good policy to be as transparent as possible.

Another problem we see every day are spider host names that do not resolve to the actual search engine, but to some data center they may be using, this will also cause webmasters to ban a bots IP range in many cases.

I will let you know by posting some of the hits on this forum in this thread as to where your bot is going here.
Reply With Quote
  #24  
Old 02-05-2008, 08:13 PM
AnthonyCea's Avatar
AnthonyCea AnthonyCea is offline
Publisher
 
Join Date: Feb 2006
Posts: 31,531
Right now Twingbot still seems to be stuck on this one thread.

08:08 PM Guest Viewing Thread
Hackers probing forums for security holes 208.84.132.11
Twingbot/1.0 (+http://www.twingbot.com/)
Reply With Quote
  #25  
Old 02-06-2008, 01:15 AM
AnthonyCea's Avatar
AnthonyCea AnthonyCea is offline
Publisher
 
Join Date: Feb 2006
Posts: 31,531
01:12 AM Guest Viewing Index
Forum Posters Union 208.84.132.11
Twingbot/1.0 (+http://www.twingbot.com/)

01:16 AM Guest Viewing Forum
Search Engine Intelligence 208.84.132.11
Twingbot/1.0 (+http://www.twingbot.com/)

07:26 AM Guest Viewing Forum
Spiders & Crawlers 208.84.132.11
Twingbot/1.0 (+http://www.twingbot.com/)

07:51 AM Guest Viewing Thread
Do you own a website or a blog 208.84.132.11
Twingbot/1.0 (+http://www.twingbot.com/)

08:17 AM Guest Viewing Thread
Hackers probing forums for security holes 208.84.132.11
Twingbot/1.0 (+http://www.twingbot.com/)
Reply With Quote
  #26  
Old 02-06-2008, 09:17 AM
ScottG ScottG is offline
Member
 
Join Date: Jan 2008
Posts: 15
Quote:
Originally Posted by AnthonyCea View Post
01:12 AM Guest Viewing Index
Forum Posters Union 208.84.132.11
Twingbot/1.0 (+http://www.twingbot.com/)

etc.
Still looking into this. In the meantime, we've stopped crawling this board until we can work this out.

Scott
Reply With Quote
  #27  
Old 03-27-2008, 05:58 AM
Sherrie Sherrie is offline
Super Member
 
Join Date: Mar 2008
Posts: 4
I have banned your bot, it appears to have used 6.5gb of my bandwidth in a short space of time, I am not impressed...
Reply With Quote
  #28  
Old 03-27-2008, 08:54 AM
ScottG ScottG is offline
Member
 
Join Date: Jan 2008
Posts: 15
Quote:
Originally Posted by Sherrie View Post
I have banned your bot, it appears to have used 6.5gb of my bandwidth in a short space of time, I am not impressed...
It was likely collecting history, but it's still supposed to be well behaved. Sometimes if it's a brand new board to us it will try to get as much as it can on a per visit basis; rather then keep hitting with multiple requests. What do you consider a "short space of time." I can have it set differently. Can you tell me what your board is?

We'll fix it. We don't want anyone bailing out because of any bad behavior on our part. It's in both our interests to have your stuff indexed in our engine. Obviously, this helps us grow the content available to our users and also potentially helps get you traffic. In any case, we respect robots.txt, but I can have your board pulled out of the crawl list altogether if that's what you want.

Sorry for any trouble.

Scott
Director, Product Management
Twing.com
Reply With Quote
  #29  
Old 03-27-2008, 09:00 AM
AnthonyCea's Avatar
AnthonyCea AnthonyCea is offline
Publisher
 
Join Date: Feb 2006
Posts: 31,531
It seems Sherrie is mad and has banned your bot Scott, but we have not banned it on FPU, we have not seen it at all since your prior post.

Have you folks quit crawling this forum for a reason because we are making public reports on Twingbot's activity here, did you fix the problems you discussed on the bot targeting one URL or getting stuck ?
Reply With Quote
  #30  
Old 03-27-2008, 09:16 AM
ScottG ScottG is offline
Member
 
Join Date: Jan 2008
Posts: 15
Quote:
Originally Posted by AnthonyCea View Post
It seems Sherrie is mad and has banned your bot Scott,
Yup. That was clear enough. Hopefully, she'll respond with her board info so I can make sure whatever problem may exist gets looked into.

Quote:
Originally Posted by AnthonyCea View Post
but we have not banned it on FPU, we have not seen it at all since your prior post.

Have you folks quit crawling this forum for a reason because we are making public reports on Twingbot's activity here, did you fix the problems you discussed on the bot targeting one URL or getting stuck ?
It's possible FPU didn't get added back in when I had it pulled out upon the initial concerns you'd had. I'll have that checked and have it put back in as the 'getting stuck' issue is supposed to have been long fixed. We certainly would never do anything punitive. We need a mutually beneficial symbiotic relationship with board owners. (The only boards we intentionally cut out of are index are those that are very clearly, 1) abandoned 'ghost' boards, 2) pure spam, 3) primarily hate speech or illegally oriented. Adult-oriented boards are allowed, though they are tagged up so users can choose to not have them in results.)

Clearly, I personally try to keep up with any news, bad or otherwise, going on. But for you guys, I should probably give you my direct Email/Phone in case you notice anything you really need to discuss immediately.

I'm getting a ton of traction in the blog and print media press about our product. But my interest isn't just Twing.com. I think the whole forum space has been left out of a lot of the buzz. All the 'sexy' stuff like social networks and blogs, etc. gets all the news. When I think - since well before the consumer web - a forum style based UI has always been the essence of online community. That's the story I'm trying to get out there. We obviously want Twing.com to be a main way for people to discover forums, but part of making that happen means helping everyone to really understand what a forum is.

Scott
Reply With Quote
  #31  
Old 03-27-2008, 09:21 AM
AnthonyCea's Avatar
AnthonyCea AnthonyCea is offline
Publisher
 
Join Date: Feb 2006
Posts: 31,531
You can fire off a PM with your contact data.

You are right about forums, social networks and blogs all have forums or ways for users to respond in threads the same way forums do and this has hurt many forums.

At the same time vBulletin has added social networking modules to their script which most forum users will not use because they like the forum posting concept.

vBulletin has also added a blogging module function available as a value added hack or plug in.

vB has also introduced a photo gallery plug in, so they are looking at forum software as more of a portal framework today.
Reply With Quote
  #32  
Old 03-27-2008, 09:38 AM
ScottG ScottG is offline
Member
 
Join Date: Jan 2008
Posts: 15
Quote:
Originally Posted by AnthonyCea View Post
You are right about forums, social networks and blogs all have forums or ways for users to respond in threads the same way forums do and this has hurt many forums.
Some blogs have added some limited threaded or otherwise more advanced commenting capabilities beyond just commenting. Yet still, I personally find these weak substitutes. In some ways, a blog post may be thought of as just a first post in a forum. But usually longer form and more often a one-to-may positioning statement as opposed to a question; which is how many forum threads start.

In any case, I continue to believe the basic inherent UI of what we think of as "forums" is simply a more effective means for many-to-many communications. It's perhaps a bit of an intangible thing. But I maintain my assertion that you can put all the portal and enhanced adjuncts around a forum, yet it remains the forum that's the core of the communal experience.

We have to make this undercurrent reality more of a surface level realization in the media that covers such things.

Scott
Reply With Quote
  #33  
Old 03-27-2008, 09:52 AM
AnthonyCea's Avatar
AnthonyCea AnthonyCea is offline
Publisher
 
Join Date: Feb 2006
Posts: 31,531
As I mentioned, many forum users simply ignore all the blog, social network and photo gallery add ons and plug in modules anyway.

The majority of forum users come to grab a bit of data, many never even post on the forum and come just to read.
Reply With Quote
  #34  
Old 03-27-2008, 02:31 PM
Sherrie Sherrie is offline
Super Member
 
Join Date: Mar 2008
Posts: 4
Quote:
Originally Posted by ScottG View Post
It was likely collecting history, but it's still supposed to be well behaved. Sometimes if it's a brand new board to us it will try to get as much as it can on a per visit basis; rather then keep hitting with multiple requests. What do you consider a "short space of time." I can have it set differently. Can you tell me what your board is?

We'll fix it. We don't want anyone bailing out because of any bad behavior on our part. It's in both our interests to have your stuff indexed in our engine. Obviously, this helps us grow the content available to our users and also potentially helps get you traffic. In any case, we respect robots.txt, but I can have your board pulled out of the crawl list altogether if that's what you want.

Sorry for any trouble.

Scott
Director, Product Management
Twing.com
Hello Scott

11gb of bandwidth has been used by bots so far this month, 6.78gb of that was you. What's more alarming for me is that when I look at my bandwidth charts, the bandwidth from your bot would appear to be coming from the last 10 days alone. At that rate I will have to upgrade my account and pay double what I pay now.

Almost forgot: http://www.apinchofhealth.com/forum/vbb/index.php
Reply With Quote
  #35  
Old 03-27-2008, 04:51 PM
ScottG ScottG is offline
Member
 
Join Date: Jan 2008
Posts: 15
Quote:
Originally Posted by Sherrie View Post
Hello Scott

11gb of bandwidth has been used by bots so far this month, 6.78gb of that was you. What's more alarming for me is that when I look at my bandwidth charts, the bandwidth from your bot would appear to be coming from the last 10 days alone. At that rate I will have to upgrade my account and pay double what I pay now.

Almost forgot: http://www.apinchofhealth.com/forum/vbb/index.php
I'm having our developer stop the crawl on your stuff and investigate what happened and why. When he assures me we fully understand the problem and it's fixed, I'll let you know and hopefully you'll be willing to flip the switch back on. Maybe next month or so I can make it up to you and have our writer do a blog post on healthy forums to check out and put in a link, etc. Or mention apunchofhealth in a magazine interview or something. I should have a couple of them coming up.

Scott
Reply With Quote
  #36  
Old 03-27-2008, 10:43 PM
Sherrie Sherrie is offline
Super Member
 
Join Date: Mar 2008
Posts: 4
Thank you for understanding Scott. I am happy to reconsider the ban after this current month is over but can't afford the risk until then.
Reply With Quote
  #37  
Old 04-02-2008, 08:42 PM
Sherrie Sherrie is offline
Super Member
 
Join Date: Mar 2008
Posts: 4
Scott may I ask you a question as I don't know a great deal about how it works with search engine bots and I have noticed you are still crawling my site through another ip albeit not as aggressive as before.

Here's an example of some stats so far this April for the top 5 search engine bots. Yours (208.84.134.21) is the last one, first set of numbers are hits and the second set is bandwidth:

Yahoo Slurp..........................................160 2........72.56 MB......03 Apr 2008 - 02:19
Googlebot......................................... ....1501........73.28 MB......03 Apr 2008 - 02:13
Unknown robot (identified by 'crawl').........1466........98.40 MB.......03 Apr 2008 - 00:04
Unknown robot (identified by 'spider')..........849........74.01 MB.......03 Apr 2008 - 01:02
Unknown robot (identified by 'bot/' or 'bot-').789......196.07 MB.......03 Apr 2008 - 02:09

Why is it that yahoo can use 72mb in bandwidth despite 1602 hits yet yours uses 196mb for only 789 hits? Is it because yahoo has been indexing my website longer?
Reply With Quote
  #38  
Old 04-02-2008, 09:52 PM
AnthonyCea's Avatar
AnthonyCea AnthonyCea is offline
Publisher
 
Join Date: Feb 2006
Posts: 31,531
Scott added the IP he crawls from to his spider identification page Sherrie as you can read in this thread.

What IP are you talking about that is new (208.84.134.21), was it using a Twingbot user agent ???
Reply With Quote
  #39  
Old 04-02-2008, 10:06 PM
ScottG ScottG is offline
Member
 
Join Date: Jan 2008
Posts: 15
Quote:
Originally Posted by Sherrie View Post
Why is it that yahoo can use 72mb in bandwidth despite 1602 hits yet yours uses 196mb for only 789 hits? Is it because yahoo has been indexing my website longer?
I'll double-check with the lead dev on the 'bot tomorrow, but I think it has to do with how we've determined to make the best initial crawls. For example, we want to minimize the http gets made, so we can grab as much as we can per visit. It's most efficient for us and for you. But, the major draw on content is when we first crawl. That is, to get history. Once we have that, subsequent visits are shorter and just for updates.

This is what happens with any new search engine sending its crawlers out. At some point, I'm going to build a forum owners center that will allow those that allow us in via robots.txt to throttle Twingbot as desired. (Of course, that could have an impact on ability to cover some updates.)

As I mentioned in a prior post, we had your board taken out of the crawl. When I'm fully assured your concerns are addressed, I'll let you know and ask permission for us to have another go. (And I'll make tech watch the hits carefully and maybe set a custom throttle level for your site, etc.)

Scott
Reply With Quote
  #40  
Old 04-02-2008, 10:09 PM
AnthonyCea's Avatar
AnthonyCea AnthonyCea is offline
Publisher
 
Join Date: Feb 2006
Posts: 31,531
Scott, if you have added new IP's did you also add them to your Twingbot information page linked in your user agent ??
Reply With Quote
Reply



Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT -5. The time now is 10:29 AM.


Powered by vBulletin®
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
2006-2011 ForumPostersUnion.com