Sunday, July 26, 2009

Googling the Right Way (a repost)

Notes

This is a republication of an article I wrote in early 2008, touching lightly upon the topic of Google queries and the mighty search-giants history. I first published the text on my then pet project site, pworks (which by the way now redirects to tenzui), and from there it got mirrored on a couple of technology-focused sites and forums

Since writing this, both I and the world of search has progressed greatly, and I hope to be able to write a follow-up, more in-depth post on the topic of searching in the future. So if you're interested in that - stay connected. ;)

Short History

Right, this will be a really, really short history lesson. If you're interested, check out what the people over there has written for yourself. (Link at page bottom)

So, Google was created by the duo Larry Page and Sergey Brin, two Stanford grad students who, although they didn't see eye-to-eye on many topics, were determined to crack the quite boring nut of organizing all that information that was spread out on the web. By 1997, their BackRub search engine had started gaining a sparkling reputation for its unique way of analyzing and ranking webpages through "back links", links pointing to a page from other pages. The system also gained attention for its interesting server environment, contrary to the "normal" high-end servers, BackRub ran on a collection of simpler PCs, collected from the campus' nooks and crannies.

From there, the story is one of unfathomable success ("Instead of discussing all the details, why don't I just write you a check?"), leading to the status of The One Search-engine we all know, love and envy.

PageRank

"Back links?" you think. Yeah, Google's system of deciding what pages are worth your reading-cycles differed from all other search engines' way at the time. The PageRank algorithm ranks all sites by giving them a rank between 0 and 10, based on how many other pages are linking to the site, and what value the linking pages has.
If you are interested in the mathematics between the PageRank algorithm, I suggest you read about it on Wikipedia. The logics behind PR is not in the scope of this article.

From this information, you can probably figure out the basics of SEO, Search Engine Optimization. Get your page linked to by the big boys. Of course, some people just can't be content with playing by the rules, and the PR-algorithm isn't perfect, so from time to time someone manages to fool the algorithm, an example being the 302 Google Jack, redirecting the new, zero-ranked page to a rank ten page, like Google itself. When Google updates the PageRanks, the new page will get the same rank as the page it linked to. Other people buys and sells high-valued links, really a kind of advertising, but with a big debate buzzing in the background. Google has requested that such links use the HTML attribute "nofollow", rendering the page linked to to be ignored when re-ranking.

The above mentioned kinds of tricks, as well as many others, can of course lead your page to get devalued, meaning that it will not be ranked at all. Play safe!

Basics

Every Joe Schmoe knows that search-engines like Google doesn't take kindly too long sentences and free-text, but he probably never bothered reading up on how the magical searchbox actually works, something he should be severely punished for. Let's leave Joe to his fate, and rise far above him, to the lands without stupid questions.
Even in the "basic" syntax collection I'm sure you are able to find a few sparkling gems you didn't know about, so skim through it even if you feel confident in your Google-Fu.

So, top down, a standalone word yields pages containing that word, a sentence enclosed with quotation-marks (" ") similarly yields pages that contain that exact phrase. If you have ever created an SQL-query for some database, I'm sure you will find a lot of similarities as we go on now. Google is actually "just a database", remember?

Command Example Result

AND [&] (ampersand) Slackware AND Linux Shows pages containing both arguments, *OBS* this is the default operator, no need to include
OR [|] (pipe) laptop OR Desktop Shows pages containing either argument
- (minus) Hamburger -McDonalds Shows pages containing the word "Hamburger", but only if they don't mention "McDonalds"
+ (plus) +coke Contrary to the "includes" belief, this limits the results to the given form only, no pluralis or other tenses
~ (tilde) ~Hacker Results include everything deemed similar to "Hacker"
* (asterisk) Fish * Chips The wildcard (*) is replaced by one or more words/characters (and, n, 'n, &)
define: define:Nocturnal A personal favorite, looks up the meaning of the word
site: Phreaking site:phrack.org Limits the search to a specific site
#...# zeroday 2007...2008 Search results include a value within the given range
info: info:www.hacktivismo.com Shows information about the site
related: related:www.google.com Shows pages similar/related to argument
link: link:www.darkmindz.com Shows sites linking to the argument
filetype: phrack filetype:pdf Results are limited to given filetype
([?]) Cyber (China & America) Nestling combines several terms in the same query
[?A] in [?B] 1 dollar in yen Converts argument A to argument B
daterange: daterange:2452122-2452234 Results are within the specified daterange. Dates are calculated by the Julian calendar
movie: movie:Hackers Movie reviews, can also find movie theaters running the movie in U.S cities
music: music:"Weird Al" Hits relate to music
stock: stock: goog Returns stock information (NYSE, NASDAQ, AMEX)
time: time: Stockholm Shows the current time in requested city
safesearch: safesearch: teen Excludes pornography
allinanchor: allinanchor: Best webcomic ever" Results are called argument by others
inanchor: foo bar inanchor:jargon As above, but not for all. The corresponding below all bear the same meaning
allintext: allintext:8-bit music Argument exists in text
intext:
allintitle: allintitle: Portfolio Argument exists in title
intitle:
allinurl: allinurl:albino sheep Argument exists in URL
inurl:

Advanced
GET-variable breakdown
http://www.google.com/search?
as_q=test (query string)
&hl=en (language)
&num=10 (number of results [ 10,20,30,50,100 ])
&btnG=Google+Search
&as_epq= (complete phrase)
&as_oq= (at least one)
&as_eq= (excluding)
&lr= (language results. [ lang_countrycode ])
&as_ft=i (filetype include or exclude. [i,e])
&as_filetype= (filetype extension)
&as_qdr=all (date [ all,M3,m6,y ])
&as_nlo= (number range, low)
&as_nhi= (number range, high)
&as_occt=any (terms occur [ any,title,body,url,links ])
&as_dt=i (restrict by domain [ i,e ])
&as_sitesearch= (restrict by [ site ])
&as_rights= (usage rights [ cc_publicdomain, cc_attribute, cc_sharealike, cc_noncommercial, cc_nonderived ]
&safe=images (safesearch [ safe=on,images=off ])
&as_rq= (similar pages)
&as_lq= (pages that link)
&as_qdr= (get only recently updated pages d[ i ] | w[ i ] | y[ i ])
&gl=us (country)

Googledorks

So, Google gives us all those handy tools for filtering away what we don't want to see, how can we use this to help securing our own systems?

Well, for example, we could use the neat Google Hacking Database, a project where people has submitted a huge collection of queries yielding results that the unskilled webmaster (the Googledork) wishes weren't there. Everything from vulnerable login-forms to passwords surfaces with some cleverly engineered queries.

Goolag

Goolag is a vulnerability scanner (and a politically involved protest..) made by the famous Cult of the Dead Cow. It builds on the above mentioned GHDB, scanning for vulnerabilities in the database. At the moment there is only a Windows-version of the program. The Goolag project is also a campaign against Google's (and a few other big players') choise to comply with the Chinese censorship policy.

Useful Queries

-inurl:htm -inurl:html intitle:"index of" "Last modified" mp3 mp3-file indexes, add desired artist
site:rapidshare.de -filetype:zip OR rar daterange:2453402-2453412 zip files on rapidshare uploaded on specified date
http://www.google.com/search?q=your+query+here&as_qdr=d1 Query results updated within one day

Others

http://www.google.com/search?q=answer to life, the universe, and everything
http://www.churchofgoogle.org
http://www.google.com/technology/pigeonrank.html

References
http://www.google.com/help/cheatsheet.html
http://www.dumblittleman.com/2007/06/20-tips-for-more-efficient-google.html
http://www.googleguide.com/advanced_operators_reference.html
http://sudarmuthu.com/blog/2006/05/07/google-search-syntax-dissected.html
http://en.wikipedia.org/wiki/PageRank
http://johnny.ihackstuff.com/

Monday, July 20, 2009

Spam vs. CAPTCHA, the lesser of two evils

For quite a while now, one of the greatest annoyances I've encountered on the net is something we've come
to accept as something comparable to "the lesser of two evils," a spambot-roadblock known as "CAPTCHA." (This acronym actually
has a meaning, which is "Completely Automated Public Turning test to tell Computers and Humans Apart.")

Now, you might ask me, "So what, you fool? Would you prefer getting every one of your forms exploited by spambots?"

Of course not, there is nothing I despise more than getting countless well-meaning offers of masculine-organ-gargantuafication. (And that is not entirely because rendering those areas any larger would be more of a nuisance than anything else.. (Bad puns end here. (Nesting ftw!)))

As much as I want to avoid those mails, I can't help feeling a great irritation every time an incomprehensible image pops up,
declaring me a fifty-line script for not realizing that S was actually a 5. More than once, this frustration has lost a forum or blog
a comment from yours truly, and probably many more from others.

When talking about these matters in a corporative fashion, you use the term "conversion ratio." Simply put, it's the percentage of visitors
that actually follows through with the action that you as author wish for them to take, werther that is filling out a form, signing up as a member, or perhaps purchasing a certain product or service. And, as you've probably figured out by now, the use of CAPTCHAs might hold a negative impact on this ratio.

At least, that was what a recent post on the SEOmoz.org-blog was all about. The author of this post put together some very clear and impressive statistics, showing that the use of CAPTCHAs yielded an 88% reduction in spam, but at the same time the figure of failed "conversions" rose drastically. And the figure of spam was not that great to begin with.
[You can read the full, very interesting post here: http://www.seomoz.org/blog/captchas-affect-on-conversion-rates]

So, when putting the conversion ratio in first perspective, not implementing a CAPTCHA seems to yield more favorable results. But really, we do not want that spam!

The same post as mentioned above provided a link to a soon three-year old alternative solution to the problem - called the "Honeypot CAPTCHA."
The general idea of this solution is that, when a spam-bot traverses your page, it looks for and attacks any tasty-looking form, but rarely ever pays any attention to user-oriented code, that is the stylesheet. So, what if we would put in a field in our form that code-wize appears as a completely normal input field, but is invisible to the real user? Get it? If that field, which a real user wouldn't fill out actually *is* filled out, we can deduce that this was the workings of something less intelligent, a couple of dirty lines of code. In the final part of this post, I wrote a simple example piece of code.
[The blog in which this solution, as well as two other interesting ones were originally posted can be found here: http://haacked.com/archive/2007/09/11/honeypot-captcha.aspx]

Opinions voiced against this method primarily concern the very important matter of accessibility - accessing a form with a field like this with a screen-reader or text-based browser would confuse and/or render the valid user unable to use the form. However, supplying proper commentary about the field should solve this matter. And also, how *does* a screen-reader/text-browser go about regular CAPTCHAs, anyways?

But facing the cold, hard facts, we can't fool ourselves into believing that spambots will stay silly forever. In fact, there should already be quite a few sophisticated ones out there. The battle against spam has been raging since the olden days, and just to provide an example I'd like to toss in a link to this very informative post by an anti-spam software developer, written in early '06. [Go ahead and read: http://unknowngenius.com/blog/archives/2006/01/30/the-state-of-spam-karma.] He discontinued working on his project, SpamKarma2, in mid-'08, and put the code up on Google Code under a standard GPLv2 license, where it's still being developed today.

Back to the point - he points out in the post I liked to above that he had already then observed an increase in spambot efficiency, making the access look more human-like, following links in a "common" manner, and even bypasses javascript-filters. A programmer who can implement a javascript parser in his spambot would hardly be challenged to create one for stylesheets as well, the reason there hasn't been any indications of one yet is simply that there hasn't been any need for it. Thus, the honeypot-solution, if widely spread, would probably be surmounted with relative ease.

If I haven't frustrated you enough yet, breaking all the good parts of the "solution" before you've even had a chance to code it into your site, here's one more. "OCR." Utilizing this technique, invented to turn scanned images into normal text, the quite famous XRumer bot was able to break Hotmail and gMail CAPTCHAs in late '08. So the race is, by all measures, a tight one. Obfuscated CAPTCHAs however still seem to hold pretty high ground, and thus it is indeed the optimal way to avoid spam. But, (back to square one), user-unfriendly and perhaps holding a negative commercial impact.

So to sum things up:
  1. Using the honeypot CAPTCHA and common sense, a "low-value" target would probably be able to avoid practically all spam without implementing intrusive techniques such as regular, hard-to-OCR CAPTCHAs.
  2. For as good security as possible, a hard-to-OCR CAPTCHA is the way to go, unfortunate but true. One nice system I'd like to push for is the reCAPTCHA service, which makes the pestering work into a good deed by using your human processing cycles to digitalize old books and publications.[For more information on this, visit http://recaptcha.net]
  3. The battle rages on. If you've got any information regarding this topic I'd more than love to hear from you. Especially if you hold some information about the workings of more sophisticated spambots. Ignorance might be bliss, but living in the grey-zone in between is pure hell.

Thanks for sticking through, hope you found this somewhat useful.

------------------------------------------
Honeypot CAPTCHA simple example:

#letshidethis { display: none; }

<form>
<input ...>
<input ...>
<textarea ...>
<div id="letshidethis">
<input name="user_info" ... (or some other, tasty-looking faux name)
</div>
<input submit>
</form>

Then in your code, you would simply check if user_info contains any data. If it does, it might very well be spam.

Stuff of notice here is to not provide a completely unintelligible name on the fake input, since some (many?) spambots seemingly look for a collection of names to post into.

Saturday, July 11, 2009

A venture into the depths of the Macbook's sleeping habits

I have a problem with my sleeping habits, you see, so I decided to put a magnet on the reed switch attached to my battery cable. Wait, what?

Correction: my Macbook has problems, and they're even graver than mine. First of all it's missing a pretty major part if you happen to be a laptop, and that's its screen. Second of all the trackpad isn't working and a key is loose (a key I happen to use quite often, too). That's okay though, all I have to do is use an external monitor and mouse, right? Right.

As I found out the other day the developers over at Apple, in a flash of genius, decided to disagree with me when they made the OS X 10.5 Leopard installer. If you unplug or break the Macbook display the computer doesn't realize that the screen isn't there and tries to use it, but you can just connect an external monitor and use as main display and it's all fine. That is, as long as you can put up with the broken/unplugged display being on either side of your external display without you being able to see it (I set it to a low resolution and put it in the middle of the left side - so I won't go to it if I try to use the hot corners). The installer unfortunately doesn't have the luxury of being able to mirror or using an external display as the main display, even if the built-in one is broken or unplugged.

There is a solution, however, I realized while searching the web for any possible keyboard shortcuts or other keys to press at boot (there aren't any, by the way). The solution is to make the computer sleep and keep the switch that makes the computer sleep on while waking it up with an external keyboard. Anyone who has used a Mac laptop with its lid closed can attest to the fact that you can use it while closed if you've got an external keyboard, mouse and display, and Apple confirms it on their site.

I've removed the display part entirely, however, so how do I make it sleep? Well, I figured I might as well try to find out what kind of switch does the magic and after some googling I found the name "reed switch." A reed switch is a switch activated by magnets, so I figured I'd just run a magnet all around and try to find it (since there are pretty strong magnets in the display pretty close to the hard drive already I figured it'd be okay, was a bit anxious at first but it passed ;)).

Nothing happened. I tried googling again and after a lot of different queries and reading I managed to find a page that sold a battery connector spare part. With a reed switch attached to it? I tried running the magnet around the area where it should have been, but got no result. Ah, what the heck, I've been in there before so why not again.


At this point my Macbook had about 30 screws lose. ;)


The guts of a Macbook, just in case you haven't seen such gory pictures before. :)

I opened it up and located the PCB where the reed switch is supposed to be. I disconnected it and put it back in and after moving it around a bit I managed to get it to work and voilá, it went to sleep!

A strong magnet (from the screen part of the Macbook, actually), wrapped in paper tape, on top of the reed switch.

All I had to do now was boot up the installer with the computer in "fake sleep mode" and to my amazement the installer showed up on the external monitor! On a side note you don't actually need an external keyboard, all you need is a keyboard. Normally you need an external one because you can't reach the built-in one, the obvious reason being that the display is in the way. If you haven't got a display the built-in keyboard works just fine, though.


The OS X 10.5 Leopard installer on an external display. :)

In the end it installed just fine, and now I run it in "sleep" mode all the time to get rid of the internal non-existing display. I taped the magnet to the side of the computer after making sure the magnetism affected the switch enough from where it was.


It looks quite neat being taped to the side like that, it's even got a similar color. :)

Though completely unrelated to what I do for a living, I thought it was interesting enough to put in a blog. I hope this will help someone else with a similar problem (there seemed to be quite a few judging from my google searches). :)

I'll end this post with a picture of where this took place - my working desk. There is no better place than in front of a sturdy, worn-down desk surrounded by computers and various parts and electronics. And knives. That's an essential tool when disassembling computers, trust me.


Thursday, July 9, 2009

The Blogging Begins

After a night of development and pixel-perfection, we've made a lot of progress on our company site as well as our business-card design. Within the next couple of days, we hope that the site will be operational and the cards in print, and with that, we also decided it was time to set up a company blog, in which we could document interesting events, solutions, problems, bugs etcetera.

If you've found your way here, we bid you welcome.

This won't be a very frequently updated blog, at least not within the foreseeable future, but we promise to keep the content that actually will be posted both interesting and rich in information.

If you find any of the content (yet to be posted) in this blog helpful, we'd be very happy if you would spare a few seconds to comment - letting us know our ramblings are of some use. Most likely, that would spur us into writing a few lines more.

And so now, at the break of dawn, it is time to recharge the batteries for yet another day with a dense task-list.

In the future, you can look forward to an exploration of the macbook 1st-gen magnetic lid sensor, and perhaps a few lines of CSS-solutions that just might save the day, freeing you from that awful rendering inconsistency between browsers that just won't go away.

Good night.