third, remove stop words: we often contain some exclamation or auxiliary to semantic expression in real life, the Internet is also so, whether it is Chinese or English, there will be some frequency is very high, but there is no real impact on the content of the word. A common "" "" "and" auxiliary, also have "ah" "ha" "ah" exclamation, there will be a "but" and "take" such as adverbs and prepositions. In the search engines, these no substantive words collectively referred to as stop words. The search engine will remove the stop words when crawling web pages, so that the more prominent theme, will also reduce the amount of calculation.
pretreatment is everyone can not see, are the search engine daemon, from the nine aspects from the analysis of each stage and all of the pretreatment, hope that the webmaster see there is a general understanding, due to limited space, today from three aspects to share, if there is wrong also, please.
second, Chinese word: word for noble baby also exist, but generally speaking are Chinese segmentation. For English, only need to split according to the words on the line, and the Chinese happens more often than English are much more complex, so for the Chinese search engine, especially love Shanghai, to consider the use of China users, so to also have their own unique place on segmentation. In the website optimization, we can do little for segmentation, tell the search engines what words together is to belong to a word can only be bold or using H tags.
I believe we are not unfamiliar, the index is called many webmaster or other data in Shanghai dragon". To search engine, the index is one of the most important steps, and web crawling and ranking has a direct relationship. The search engines crawl the page and cannot be used for ranking, because the data on the Internet is huge, so when the user may not in real time from all the web pages retrieved and returned in the search, but the search engine from their database results returned to the user. This database is pre treated, so there is a pretreatment argument.
fourth, noise cancellation: you may not understand what is called noise, noise in the Internet, means that there is no substantial help page elements on the theme of the site, such as a lot of copyright text, navigation and contents of advertisements. The classification and historical archive page Page > in many blogs
first, extract the text: now Internet information or text-based search engine, so the key or text from a web page, we see a lot of pictures, including video and JS technology to user ranking content. So for search engines, web page text extraction is the first thing to do. In addition to some common text body, also includes extraction Meta tags in the text and pictures of the ALT label and so on. Another is the anchor text, anchor text in the role of web page ranking is very important.