UnknownNovelist
Well-known member
- Joined
- Oct 25, 2019
- Messages
- 57
- Points
- 58
Hi,
Before we dive into what it is I’m seeking guidance on, I feel that there is a need to set the context on what exactly this topic is about: ScribbleHub Explorer.
ScribbleHub Explorer is an AI-powered novel recommendation engine. The idea for this project arose from my own frustrations with hitting the disabled “next chapter" button on my favorite novel and struggling to find other stories that piqued my interest.
ScribbleHub Explorer was designed with three main goals in mind: to help aspiring authors attract more readers who would love their stories, to help readers find novels that suit their tastes, and to help ScribbleHub generate more traffic and ad revenue. As a website that offers free novels and a supportive community, ScribbleHub can only thrive by attracting more visitors, which in turn requires offering more novels that readers will love, and thereby creating a virtuous cycle that can create ad revenue to keep the site going.
After nearly a month of coding, I'm proud to say that ScribbleHub Explorer is now almost out of alpha and capable of generating personalized recommendations based on each user's reading habits. The engine is powered by two recommendation systems. The first is a "pure" content-based recommendation system that uses cosine similarity to compare the synopsis, genres, tags, fandom, rating, favorites, readers, author, and more of each novel on ScribbleHub to find similar novels that a user might enjoy. This recommendation system is ideal for ScribbleHub users who don't have an account, have disabled their reading lists, or simply don't have any novels added to their reading lists.
The second recommendation system is the "AI" behind it all. It uses a user-item-based collaborative filtering recommendation approach, and generates recommendations via alternating least squares, a matrix factorization AI/ML algorithm. This is done based on implicit feedback data collected from user profiles. This system is particularly good at identifying which novels a user might like based on their reading habits.
In addition to these recommendation systems, ScribbleHub Explorer also includes a range of other features to enhance the user experience. These features include a statistics page that can help authors to identify the latest trends among readers, a top 100 list that calculates novel rankings based on IMDB's algorithm for calculating movie rankings, and more.
The app's main page is the content-based recommendation part, which allows users to find similar novels by entering the novel ID or URL, without needing a ScribbleHub user ID.
Main Page
Getting a recommendation
The second page is the AI recommendation engine, which provides personalized recommendations based on a user's reading habits.
Personalized Recommendations
There is also a second tab to find similar novels based on what other users have added to their reading lists.
Similar Novels
Here's a view of the Top 100 list.
Top 100
And all recommendations can be filtered on genres and tags.
Filtering recommendations
Finally, the statistics page provides insights into which novels, genres, and tags are trending, and which are performing best.
Statistics
The entire album can be found here: Album
Now that you have an idea of what I've been working on, we can get into the heart of the matter.
As with all AI, the quality of data is essential. If the data fed to an AI isn't meticulously cleansed and orderly, the resulting recommendations or classifications will be of poor quality. To gather data on the reading habits of tens of thousands of users and the meta-data on thousands of novels, I initially used a rudimentary browser-based plugin that could "scrape" ScribbleHub. However, the dataset generated was not well-formatted and the browser-plugin often glitched, particularly on larger reading lists. As a result, although it was workable in principle, the metrics for evaluating the quality of the AI's recommendations (such as p@k, map@k, and ndcg@k) were not ideal. This is my evaluation when I’m self-critical on the data gathering process, as well as my experience working with AI/ML algorithms.
Despite its limitations, the dataset served as a starting point, and the resulting recommendations you see in the screenshots above were generated. However, I was aware that the AI's suggestions could be improved. My goal was for users to be astonished by the recommendations, thinking not just "oh, that's nice," but rather "oh, damn! Why didn't I know about this novel before?" Thus, I began working on a web crawler.
For those unfamiliar with web crawlers, they are data-gathering tools akin to nukes. Those pirate sites that steal novels and offer them for free? They're the work of web crawlers.
Web crawlers are highly effective because they can make numerous concurrent requests to the server, "scrape" everything on the webpage, and move on to the next link. In theory, a nefarious individual could make a crawler that would request a new page every millisecond, hundreds of times concurrently (simultaneously), and overwhelm the server with requests while scraping everything it had to offer. This is similar to a DOS (Denial of Service) attack carried out by a single computer. If a cloud computer were to do this in a distributed manner, it would be nearly equivalent to a legitimate DDOS attack.
As with any tool, web crawlers can be used for both good and nefarious purposes. The onus is on the designer to ensure that their implementation does not overload the server they are visiting with the bot. It's important to remember that Google, the world's largest web crawler, visits millions of websites daily and "steals" their content to index it into their search engine.
In summary, the use of web crawlers is a gray area when it comes to usage. It's wrong to use them to steal content and bombard a server, but when used responsibly and with appropriate precautions, they can be a valuable tool for gathering data that could benefit the host site.
I felt it was important to discuss web crawlers in this post to provide insight into how they can be used to scrape a website.
With these considerations in mind, I created a plan to "crawl responsibly," and create an "ethical crawler," with the following key points:
My doubts about what I was doing first arose around user profile 6,000 when ScribbleHub sent back a 403 response – effectively banning me, albeit temporarily. I realized that I had generated a lot of traffic despite the measures I had put in place, and I had not contacted ScribbleHub about my activity. To them, it would appear as if a strange bot was crawling their entire user index, generating a ton of traffic. ScribbleHub has about 116,976 users, and I was making 2 requests to every user profile, resulting in 233,952 requests for which I was solely responsible. My scraping alone accounted for 4.25% of ScribbleHub's total monthly traffic of 5.5 million visitors, according to SimilarWeb.
After realizing the potential inconvenience, I messaged Tony to explain the purpose of my activity and assure him that I had no plans to monetize the data collected. I hoped that this would be a one-time thing and that users would ultimately benefit from more story choices, driving up ScribbleHub's ad revenue. I stressed that future crawling would not be done on a large scale as the app would passively fetch the latest user information from ScribbleHub only upon explicit user request. I stressed that I would immediately cease and desist upon request.
With the message sent, I waited to see if I would get a reply and if I was still blocked. Around 48 hours later, with no reply, I resumed crawling.
My thoughts at that time were “this is a one-off thing. While an inconvenience now, it’s not going to happen again.”
And this is where I made my first mistake. Unknowingly, I had made a typo in the XPath (a language expression to select nodes in an HTML document). While at first glance, the data given to me by the spider looks OK, it’s only on a second readthrough that you would see the errors.
It appears that the spider had crawled the user's novels correctly, but on closer examination, I discovered that only the title was unique, and the novel_href and the novel_id had become duplicated across all novels in the scraped reading list.
I felt a sheer amount of indignation, anger, self-blame, and frustration when I discovered this, and I was devastated to find out that this was all due to a missing an asterisk “*” in the XPath. This was compounded by the fact that I only discovered this error after I had crawled user number 71,188.
After cooling my head, I tried to salvage the data. With some Left-Join merge on the title with the novels I had scraped earlier, I managed to salvage 41,012 users' reading lists, but this still left some 30,000 users that needed to be re-crawled. Not to mention that I would lose the ratings that the salvaged users had given and their progress – something I had planned to use to upgrade the AI from implicit feedback to rating-based. Moreover, re-crawling this many users is not something that could be done lightly. ScribbleHub's server had also just sent another 403 error code back to the spider, indicating that it was becoming less tolerant of my scraping behavior (note that this is even with all kinds of precautions taken in place to ensure I wasn't burdening the server in any way).
Needless to say, I was in a dark place and briefly toyed with the idea of abandoning all my principles and going full dark side. I could disguise my bot under different random user-agents, send faked headers to the server, and use a rotating proxy or VPN to hide my IP. And I admit, I did briefly test out some implementations.
It was sometime later that same evening, I had a sobering thought, and thought to myself “What am I doing?”
I needed to weigh the potential benefits against the harm my actions could cause to ScribbleHub and its users. I’m not trying to do something nefarious – and I don’t want to do something nefarious. This is a hobby project, where my key goals were to “help aspiring authors get more readers who love their stories, to enable readers to find a good story that suits their taste and to help ScribbleHub generate more traffic and ad revenue.”
So now we arrive at the dilemma I'm facing, and I hope we can discuss this openly and come up with a solution that benefits everyone. At the same time I would like to ask for your collective wisdom on how to proceed;
If I don't re-crawl, I could use the dataset that I have already collected, but the AI would not be as good at generating personalized recommendations. Furthermore, I need to crawl the statistics of the novels regularly to ensure that the latest novel information is up-to-date.
I would understand if there are concerns about my project, and I welcome any discussion or feedback. Is what I'm making dead in the water? Is it something that the community wants? I want to emphasize that my intention is to help the community and not to cause any harm.
If you’ve stuck with me this long, then thanks for taking the time to read my post, and I look forward to hearing your thoughts.
/Unknown Novelist
TLDR; I created an app that can generate AI powered personalized recommendation. This however requires me to re-scrape all users on ScribbleHub which will generate a lot of traffic – an undue expense for the site. Furthermore, to keep the novel information updated, I would need to collect data about once a month. Is this still a good app?
Before we dive into what it is I’m seeking guidance on, I feel that there is a need to set the context on what exactly this topic is about: ScribbleHub Explorer.
ScribbleHub Explorer is an AI-powered novel recommendation engine. The idea for this project arose from my own frustrations with hitting the disabled “next chapter" button on my favorite novel and struggling to find other stories that piqued my interest.
ScribbleHub Explorer was designed with three main goals in mind: to help aspiring authors attract more readers who would love their stories, to help readers find novels that suit their tastes, and to help ScribbleHub generate more traffic and ad revenue. As a website that offers free novels and a supportive community, ScribbleHub can only thrive by attracting more visitors, which in turn requires offering more novels that readers will love, and thereby creating a virtuous cycle that can create ad revenue to keep the site going.
After nearly a month of coding, I'm proud to say that ScribbleHub Explorer is now almost out of alpha and capable of generating personalized recommendations based on each user's reading habits. The engine is powered by two recommendation systems. The first is a "pure" content-based recommendation system that uses cosine similarity to compare the synopsis, genres, tags, fandom, rating, favorites, readers, author, and more of each novel on ScribbleHub to find similar novels that a user might enjoy. This recommendation system is ideal for ScribbleHub users who don't have an account, have disabled their reading lists, or simply don't have any novels added to their reading lists.
The second recommendation system is the "AI" behind it all. It uses a user-item-based collaborative filtering recommendation approach, and generates recommendations via alternating least squares, a matrix factorization AI/ML algorithm. This is done based on implicit feedback data collected from user profiles. This system is particularly good at identifying which novels a user might like based on their reading habits.
In addition to these recommendation systems, ScribbleHub Explorer also includes a range of other features to enhance the user experience. These features include a statistics page that can help authors to identify the latest trends among readers, a top 100 list that calculates novel rankings based on IMDB's algorithm for calculating movie rankings, and more.
The app's main page is the content-based recommendation part, which allows users to find similar novels by entering the novel ID or URL, without needing a ScribbleHub user ID.
Main Page
Getting a recommendation
The second page is the AI recommendation engine, which provides personalized recommendations based on a user's reading habits.
Personalized Recommendations
There is also a second tab to find similar novels based on what other users have added to their reading lists.
Similar Novels
Here's a view of the Top 100 list.
Top 100
And all recommendations can be filtered on genres and tags.
Filtering recommendations
Finally, the statistics page provides insights into which novels, genres, and tags are trending, and which are performing best.
Statistics
The entire album can be found here: Album
Now that you have an idea of what I've been working on, we can get into the heart of the matter.
As with all AI, the quality of data is essential. If the data fed to an AI isn't meticulously cleansed and orderly, the resulting recommendations or classifications will be of poor quality. To gather data on the reading habits of tens of thousands of users and the meta-data on thousands of novels, I initially used a rudimentary browser-based plugin that could "scrape" ScribbleHub. However, the dataset generated was not well-formatted and the browser-plugin often glitched, particularly on larger reading lists. As a result, although it was workable in principle, the metrics for evaluating the quality of the AI's recommendations (such as p@k, map@k, and ndcg@k) were not ideal. This is my evaluation when I’m self-critical on the data gathering process, as well as my experience working with AI/ML algorithms.
Despite its limitations, the dataset served as a starting point, and the resulting recommendations you see in the screenshots above were generated. However, I was aware that the AI's suggestions could be improved. My goal was for users to be astonished by the recommendations, thinking not just "oh, that's nice," but rather "oh, damn! Why didn't I know about this novel before?" Thus, I began working on a web crawler.
For those unfamiliar with web crawlers, they are data-gathering tools akin to nukes. Those pirate sites that steal novels and offer them for free? They're the work of web crawlers.
Web crawlers are highly effective because they can make numerous concurrent requests to the server, "scrape" everything on the webpage, and move on to the next link. In theory, a nefarious individual could make a crawler that would request a new page every millisecond, hundreds of times concurrently (simultaneously), and overwhelm the server with requests while scraping everything it had to offer. This is similar to a DOS (Denial of Service) attack carried out by a single computer. If a cloud computer were to do this in a distributed manner, it would be nearly equivalent to a legitimate DDOS attack.
As with any tool, web crawlers can be used for both good and nefarious purposes. The onus is on the designer to ensure that their implementation does not overload the server they are visiting with the bot. It's important to remember that Google, the world's largest web crawler, visits millions of websites daily and "steals" their content to index it into their search engine.
In summary, the use of web crawlers is a gray area when it comes to usage. It's wrong to use them to steal content and bombard a server, but when used responsibly and with appropriate precautions, they can be a valuable tool for gathering data that could benefit the host site.
I felt it was important to discuss web crawlers in this post to provide insight into how they can be used to scrape a website.
With these considerations in mind, I created a plan to "crawl responsibly," and create an "ethical crawler," with the following key points:
- Only collect the necessary publicly available data required.
- For novels: only collect usage statistics from the "Stats" page, title, synopsis, and novel meta-data (such as rating, number of chapters, number of reviews, number of favorites, readers, and status), and link to the image. I did NOT and will not collect any chapter content or copyrighted material.
- For Users: only collect usernames, joined date, followers, following, number of comments made, and novels added to reading lists. I did NOT collect any private data such as location, last activity, etc. which could potentially be used to identify users’ real identity.
- Responsibly crawl ScribbleHub using a high download delay (i.e., do not make a lot of requests in succession) and do not use a high number of concurrent requests (only 1 request every 2-5 seconds).
- Use Auto-throttling to detect server load. If the server has a peak in latency, which couold indicate that the server is under stress, then dial back the scraping and use a higher download delay and be more tolerant.
- Identify myself in the User-Agent string, so that ScribbleHub can identify me when I crawl. The User-Agent string includes the name of the bot and my email so that ScribbleHub can contact me in case of any concerns. Do NOT disguise myself as a legitimate browser or pass on fake headers.
- Obey the robots.txt document found at www.scribblehub.com/robots.txt. Do NOT visit any disallowed sites.
My doubts about what I was doing first arose around user profile 6,000 when ScribbleHub sent back a 403 response – effectively banning me, albeit temporarily. I realized that I had generated a lot of traffic despite the measures I had put in place, and I had not contacted ScribbleHub about my activity. To them, it would appear as if a strange bot was crawling their entire user index, generating a ton of traffic. ScribbleHub has about 116,976 users, and I was making 2 requests to every user profile, resulting in 233,952 requests for which I was solely responsible. My scraping alone accounted for 4.25% of ScribbleHub's total monthly traffic of 5.5 million visitors, according to SimilarWeb.
After realizing the potential inconvenience, I messaged Tony to explain the purpose of my activity and assure him that I had no plans to monetize the data collected. I hoped that this would be a one-time thing and that users would ultimately benefit from more story choices, driving up ScribbleHub's ad revenue. I stressed that future crawling would not be done on a large scale as the app would passively fetch the latest user information from ScribbleHub only upon explicit user request. I stressed that I would immediately cease and desist upon request.
With the message sent, I waited to see if I would get a reply and if I was still blocked. Around 48 hours later, with no reply, I resumed crawling.
My thoughts at that time were “this is a one-off thing. While an inconvenience now, it’s not going to happen again.”
And this is where I made my first mistake. Unknowingly, I had made a typo in the XPath (a language expression to select nodes in an HTML document). While at first glance, the data given to me by the spider looks OK, it’s only on a second readthrough that you would see the errors.
JSON:
“{"novels": [{"title": "=÷Horizon÷=", "novel_href": "https://www.scribblehub.com/series/173577/horizon/", "rating": 3, "progress": "41/73", "novel_id": 173577}, {"title": "A Bond Beyond Blood", "novel_href": "https://www.scribblehub.com/series/173577/horizon/", "rating": 3, "progress": "41/73", "novel_id": 173577}, …”
It appears that the spider had crawled the user's novels correctly, but on closer examination, I discovered that only the title was unique, and the novel_href and the novel_id had become duplicated across all novels in the scraped reading list.
I felt a sheer amount of indignation, anger, self-blame, and frustration when I discovered this, and I was devastated to find out that this was all due to a missing an asterisk “*” in the XPath. This was compounded by the fact that I only discovered this error after I had crawled user number 71,188.
After cooling my head, I tried to salvage the data. With some Left-Join merge on the title with the novels I had scraped earlier, I managed to salvage 41,012 users' reading lists, but this still left some 30,000 users that needed to be re-crawled. Not to mention that I would lose the ratings that the salvaged users had given and their progress – something I had planned to use to upgrade the AI from implicit feedback to rating-based. Moreover, re-crawling this many users is not something that could be done lightly. ScribbleHub's server had also just sent another 403 error code back to the spider, indicating that it was becoming less tolerant of my scraping behavior (note that this is even with all kinds of precautions taken in place to ensure I wasn't burdening the server in any way).
Needless to say, I was in a dark place and briefly toyed with the idea of abandoning all my principles and going full dark side. I could disguise my bot under different random user-agents, send faked headers to the server, and use a rotating proxy or VPN to hide my IP. And I admit, I did briefly test out some implementations.
It was sometime later that same evening, I had a sobering thought, and thought to myself “What am I doing?”
I needed to weigh the potential benefits against the harm my actions could cause to ScribbleHub and its users. I’m not trying to do something nefarious – and I don’t want to do something nefarious. This is a hobby project, where my key goals were to “help aspiring authors get more readers who love their stories, to enable readers to find a good story that suits their taste and to help ScribbleHub generate more traffic and ad revenue.”
So now we arrive at the dilemma I'm facing, and I hope we can discuss this openly and come up with a solution that benefits everyone. At the same time I would like to ask for your collective wisdom on how to proceed;
Should I re-crawl all users, despite the extra traffic it may incur on the servers? I would appreciate Tony's input on this matter and will adhere to any guidelines he dictates.
If I don't re-crawl, I could use the dataset that I have already collected, but the AI would not be as good at generating personalized recommendations. Furthermore, I need to crawl the statistics of the novels regularly to ensure that the latest novel information is up-to-date.
I would understand if there are concerns about my project, and I welcome any discussion or feedback. Is what I'm making dead in the water? Is it something that the community wants? I want to emphasize that my intention is to help the community and not to cause any harm.
If you’ve stuck with me this long, then thanks for taking the time to read my post, and I look forward to hearing your thoughts.
/Unknown Novelist
TLDR; I created an app that can generate AI powered personalized recommendation. This however requires me to re-scrape all users on ScribbleHub which will generate a lot of traffic – an undue expense for the site. Furthermore, to keep the novel information updated, I would need to collect data about once a month. Is this still a good app?
Last edited: