Now Live! ScribbleHub Explorer – An AI-Powered Novel Recommendation Engine: Feedback Needed!

UnknownNovelist

Well-known member
Joined
Oct 25, 2019
Messages
57
Points
58
Hi,

Before we dive into what it is I’m seeking guidance on, I feel that there is a need to set the context on what exactly this topic is about: ScribbleHub Explorer.

ScribbleHub Explorer is an AI-powered novel recommendation engine. The idea for this project arose from my own frustrations with hitting the disabled “next chapter" button on my favorite novel and struggling to find other stories that piqued my interest.

ScribbleHub Explorer was designed with three main goals in mind: to help aspiring authors attract more readers who would love their stories, to help readers find novels that suit their tastes, and to help ScribbleHub generate more traffic and ad revenue. As a website that offers free novels and a supportive community, ScribbleHub can only thrive by attracting more visitors, which in turn requires offering more novels that readers will love, and thereby creating a virtuous cycle that can create ad revenue to keep the site going.

After nearly a month of coding, I'm proud to say that ScribbleHub Explorer is now almost out of alpha and capable of generating personalized recommendations based on each user's reading habits. The engine is powered by two recommendation systems. The first is a "pure" content-based recommendation system that uses cosine similarity to compare the synopsis, genres, tags, fandom, rating, favorites, readers, author, and more of each novel on ScribbleHub to find similar novels that a user might enjoy. This recommendation system is ideal for ScribbleHub users who don't have an account, have disabled their reading lists, or simply don't have any novels added to their reading lists.

The second recommendation system is the "AI" behind it all. It uses a user-item-based collaborative filtering recommendation approach, and generates recommendations via alternating least squares, a matrix factorization AI/ML algorithm. This is done based on implicit feedback data collected from user profiles. This system is particularly good at identifying which novels a user might like based on their reading habits.

In addition to these recommendation systems, ScribbleHub Explorer also includes a range of other features to enhance the user experience. These features include a statistics page that can help authors to identify the latest trends among readers, a top 100 list that calculates novel rankings based on IMDB's algorithm for calculating movie rankings, and more.

The app's main page is the content-based recommendation part, which allows users to find similar novels by entering the novel ID or URL, without needing a ScribbleHub user ID.


Main Page

Getting a recommendation

The second page is the AI recommendation engine, which provides personalized recommendations based on a user's reading habits.

Personalized Recommendations

There is also a second tab to find similar novels based on what other users have added to their reading lists.

Similar Novels

Here's a view of the Top 100 list.

Top 100

And all recommendations can be filtered on genres and tags.

Filtering recommendations

Finally, the statistics page provides insights into which novels, genres, and tags are trending, and which are performing best.

Statistics

The entire album can be found here: Album

Now that you have an idea of what I've been working on, we can get into the heart of the matter.

As with all AI, the quality of data is essential. If the data fed to an AI isn't meticulously cleansed and orderly, the resulting recommendations or classifications will be of poor quality. To gather data on the reading habits of tens of thousands of users and the meta-data on thousands of novels, I initially used a rudimentary browser-based plugin that could "scrape" ScribbleHub. However, the dataset generated was not well-formatted and the browser-plugin often glitched, particularly on larger reading lists. As a result, although it was workable in principle, the metrics for evaluating the quality of the AI's recommendations (such as p@k, map@k, and ndcg@k) were not ideal. This is my evaluation when I’m self-critical on the data gathering process, as well as my experience working with AI/ML algorithms.

Despite its limitations, the dataset served as a starting point, and the resulting recommendations you see in the screenshots above were generated. However, I was aware that the AI's suggestions could be improved. My goal was for users to be astonished by the recommendations, thinking not just "oh, that's nice," but rather "oh, damn! Why didn't I know about this novel before?" Thus, I began working on a web crawler.

For those unfamiliar with web crawlers, they are data-gathering tools akin to nukes. Those pirate sites that steal novels and offer them for free? They're the work of web crawlers.

Web crawlers are highly effective because they can make numerous concurrent requests to the server, "scrape" everything on the webpage, and move on to the next link. In theory, a nefarious individual could make a crawler that would request a new page every millisecond, hundreds of times concurrently (simultaneously), and overwhelm the server with requests while scraping everything it had to offer. This is similar to a DOS (Denial of Service) attack carried out by a single computer. If a cloud computer were to do this in a distributed manner, it would be nearly equivalent to a legitimate DDOS attack.

As with any tool, web crawlers can be used for both good and nefarious purposes. The onus is on the designer to ensure that their implementation does not overload the server they are visiting with the bot. It's important to remember that Google, the world's largest web crawler, visits millions of websites daily and "steals" their content to index it into their search engine.

In summary, the use of web crawlers is a gray area when it comes to usage. It's wrong to use them to steal content and bombard a server, but when used responsibly and with appropriate precautions, they can be a valuable tool for gathering data that could benefit the host site.


I felt it was important to discuss web crawlers in this post to provide insight into how they can be used to scrape a website.


With these considerations in mind, I created a plan to "crawl responsibly," and create an "ethical crawler," with the following key points:
  • Only collect the necessary publicly available data required.
  • For novels: only collect usage statistics from the "Stats" page, title, synopsis, and novel meta-data (such as rating, number of chapters, number of reviews, number of favorites, readers, and status), and link to the image. I did NOT and will not collect any chapter content or copyrighted material.
  • For Users: only collect usernames, joined date, followers, following, number of comments made, and novels added to reading lists. I did NOT collect any private data such as location, last activity, etc. which could potentially be used to identify users’ real identity.
  • Responsibly crawl ScribbleHub using a high download delay (i.e., do not make a lot of requests in succession) and do not use a high number of concurrent requests (only 1 request every 2-5 seconds).
  • Use Auto-throttling to detect server load. If the server has a peak in latency, which couold indicate that the server is under stress, then dial back the scraping and use a higher download delay and be more tolerant.
  • Identify myself in the User-Agent string, so that ScribbleHub can identify me when I crawl. The User-Agent string includes the name of the bot and my email so that ScribbleHub can contact me in case of any concerns. Do NOT disguise myself as a legitimate browser or pass on fake headers.
  • Obey the robots.txt document found at www.scribblehub.com/robots.txt. Do NOT visit any disallowed sites.
After carefully considering the technical requirements, I proceeded to design the spider (which is another term for a web crawler. What else is a web crawler? Spiders!). To my surprise, what previously took me several days to manually crawl through the novels, only took a day or so with the spider. Not only was it faster, but it also produced complete and error-free data. Naturally, I was thrilled and quickly turned my attention to creating a spider that could do the same for users and their reading lists - a crucial part of training the AI. After a few attempts and verification, I successfully created a spider that collected the necessary data, and I sent it off to crawl through the user profiles on ScribbleHub.

My doubts about what I was doing first arose around user profile 6,000 when ScribbleHub sent back a 403 response – effectively banning me, albeit temporarily. I realized that I had generated a lot of traffic despite the measures I had put in place, and I had not contacted ScribbleHub about my activity. To them, it would appear as if a strange bot was crawling their entire user index, generating a ton of traffic. ScribbleHub has about 116,976 users, and I was making 2 requests to every user profile, resulting in 233,952 requests for which I was solely responsible. My scraping alone accounted for 4.25% of ScribbleHub's total monthly traffic of 5.5 million visitors, according to SimilarWeb.

After realizing the potential inconvenience, I messaged Tony to explain the purpose of my activity and assure him that I had no plans to monetize the data collected. I hoped that this would be a one-time thing and that users would ultimately benefit from more story choices, driving up ScribbleHub's ad revenue. I stressed that future crawling would not be done on a large scale as the app would passively fetch the latest user information from ScribbleHub only upon explicit user request. I stressed that I would immediately cease and desist upon request.

With the message sent, I waited to see if I would get a reply and if I was still blocked. Around 48 hours later, with no reply, I resumed crawling.

My thoughts at that time were “this is a one-off thing. While an inconvenience now, it’s not going to happen again.”

And this is where I made my first mistake. Unknowingly, I had made a typo in the XPath (a language expression to select nodes in an HTML document). While at first glance, the data given to me by the spider looks OK, it’s only on a second readthrough that you would see the errors.

JSON:
“{"novels": [{"title": "=÷Horizon÷=", "novel_href": "https://www.scribblehub.com/series/173577/horizon/", "rating": 3, "progress": "41/73", "novel_id": 173577}, {"title": "A Bond Beyond Blood", "novel_href": "https://www.scribblehub.com/series/173577/horizon/", "rating": 3, "progress": "41/73", "novel_id": 173577}, …”

It appears that the spider had crawled the user's novels correctly, but on closer examination, I discovered that only the title was unique, and the novel_href and the novel_id had become duplicated across all novels in the scraped reading list.

I felt a sheer amount of indignation, anger, self-blame, and frustration when I discovered this, and I was devastated to find out that this was all due to a missing an asterisk “*” in the XPath. This was compounded by the fact that I only discovered this error after I had crawled user number 71,188.

After cooling my head, I tried to salvage the data. With some Left-Join merge on the title with the novels I had scraped earlier, I managed to salvage 41,012 users' reading lists, but this still left some 30,000 users that needed to be re-crawled. Not to mention that I would lose the ratings that the salvaged users had given and their progress – something I had planned to use to upgrade the AI from implicit feedback to rating-based. Moreover, re-crawling this many users is not something that could be done lightly. ScribbleHub's server had also just sent another 403 error code back to the spider, indicating that it was becoming less tolerant of my scraping behavior (note that this is even with all kinds of precautions taken in place to ensure I wasn't burdening the server in any way).

Needless to say, I was in a dark place and briefly toyed with the idea of abandoning all my principles and going full dark side. I could disguise my bot under different random user-agents, send faked headers to the server, and use a rotating proxy or VPN to hide my IP. And I admit, I did briefly test out some implementations.

It was sometime later that same evening, I had a sobering thought, and thought to myself “What am I doing?”

I needed to weigh the potential benefits against the harm my actions could cause to ScribbleHub and its users. I’m not trying to do something nefarious – and I don’t want to do something nefarious. This is a hobby project, where my key goals were to “help aspiring authors get more readers who love their stories, to enable readers to find a good story that suits their taste and to help ScribbleHub generate more traffic and ad revenue.”

So now we arrive at the dilemma I'm facing, and I hope we can discuss this openly and come up with a solution that benefits everyone. At the same time I would like to ask for your collective wisdom on how to proceed;

Should I re-crawl all users, despite the extra traffic it may incur on the servers? I would appreciate Tony's input on this matter and will adhere to any guidelines he dictates.

If I don't re-crawl, I could use the dataset that I have already collected, but the AI would not be as good at generating personalized recommendations. Furthermore, I need to crawl the statistics of the novels regularly to ensure that the latest novel information is up-to-date.

I would understand if there are concerns about my project, and I welcome any discussion or feedback. Is what I'm making dead in the water? Is it something that the community wants? I want to emphasize that my intention is to help the community and not to cause any harm.

If you’ve stuck with me this long, then thanks for taking the time to read my post, and I look forward to hearing your thoughts.

/Unknown Novelist

TLDR; I created an app that can generate AI powered personalized recommendation. This however requires me to re-scrape all users on ScribbleHub which will generate a lot of traffic – an undue expense for the site. Furthermore, to keep the novel information updated, I would need to collect data about once a month. Is this still a good app?
 
Last edited:

Corty

Sneaking in, stealing your socks.
Joined
Oct 7, 2022
Messages
2,377
Points
128
As my tech-savviness ends at modding Fallout 4 and doing texture swapping at max, I am amazed by the work you did yet understand very little of it; only the core problem is that your actions were considered malicious or close to being identified as a DDoS attempt. I'll watch on with curiosity what will be others' responses to it as it sounds interesting, but as far as I know, only a few people can give you the green light.

edit: checking the images and seeing the number of completed novels in contrast to the "ongoing but mostly abandoned anyway" ones... makes me sad.
 

Zirrboy

Fueled by anger
Joined
Jan 25, 2021
Messages
1,145
Points
153
If server load is your concern, I'd try to work out an API solution with Tony, or leave it be. If you can pull the relevant data in batches without the html to bloat (perhaps even filtering for recent changes) the stress should become minimal.

As for the design itself, I fear that the model evaluating users will fit towards the already present exposure channels.
Of course there'll be an otherwise hidden gem here and there, but if the majority of the reading list data represents findings through trending, latest chapter and perhaps search, optimizing for that will yield similar results.

There are tons of novels published as of now, but just filtering for them still being active and having a certain number of chapters available, alongside some genre tags brings, the offers down to reasonable numbers.

My last point is one of philosophy. I like SH's relatively powerful search engine over most others, but the better the tool to preselect, the quicker you'll have found - and read through - the stories that interest you the most, from which point on it goes downhill.
If you read all the best suited at the start, the average to decent works the site may have to offer will seem bland afterwards, and you know there's little chance the tool might have missed something you'll be able to look forward to.
 

TotallyHuman

Well-known member
Joined
Feb 13, 2019
Messages
4,155
Points
183
I feel like the current system of searching through filters is enough.
 

UnknownNovelist

Well-known member
Joined
Oct 25, 2019
Messages
57
Points
58
As my tech-savviness ends at modding Fallout 4 and doing texture swapping at max, I am amazed by the work you did yet understand very little of it; only the core problem is that your actions were considered malicious or close to being identified as a DDoS attempt. I'll watch on with curiosity what will be others' responses to it as it sounds interesting, but as far as I know, only a few people can give you the green light.

edit: checking the images and seeing the number of completed novels in contrast to the "ongoing but mostly abandoned anyway" ones... makes me sad.
Thanks for your interest. I have plans in the roadmap to add more filters, including status, so that you can view ongoing novels.
 

UnknownNovelist

Well-known member
Joined
Oct 25, 2019
Messages
57
Points
58
I feel like the current system of searching through filters is enough.
Thanks for your input. It's also a valid point. No need to generate a lot of traffic if no-one is going to use it.

If server load is your concern, I'd try to work out an API solution with Tony, or leave it be. If you can pull the relevant data in batches without the html to bloat (perhaps even filtering for recent changes) the stress should become minimal.
Yes that would be ideal, but I don't know if Tony is willing to go to such hassle. I would, for one, welcome it.
As for the design itself, I fear that the model evaluating users will fit towards the already present exposure channels.
Of course there'll be an otherwise hidden gem here and there, but if the majority of the reading list data represents findings through trending, latest chapter and perhaps search, optimizing for that will yield similar results.
Not exactly. While trending, latest chapters, etc. do have their merits. It requires you to spend the effort and time in doing so. I mean, who hasn't scrolled through the top rankings endlessly in hopes to find something? The AI would be better at spotting specific tastes by analyzing what you like and then recommending what you haven't seen.
There are tons of novels published as of now, but just filtering for them still being active and having a certain number of chapters available, alongside some genre tags brings, the offers down to reasonable numbers.
Indeed - I was also surprised by the amount of novels without chapters or just 1-2 chapters before going into hiatus (something around 45% of all novels). For the AI to work properly, it requires that the novels and readers have a minimum number respectively; for now I found that 5 chapters per novel, and 5 novels added to reading lists is a good start. But again, I can't make any conclusion based on the poor dataset that I have at the moment. It requires a better grid search of the hyper-parameters to figure out across different scenarios.
My last point is one of philosophy. I like SH's relatively powerful search engine over most others, but the better the tool to preselect, the quicker you'll have found - and read through - the stories that interest you the most, from which point on it goes downhill.
If you read all the best suited at the start, the average to decent works the site may have to offer will seem bland afterwards, and you know there's little chance the tool might have missed something you'll be able to look forward to.
While the search function on SH is good, but that's only if you know the specific title. You, ofcourse, also have the Series Finder, but again, that requires you to know what genres, tags, etc. you want and input that manually.
 

KrakenRiderEmma

Well-known member
Joined
Jan 27, 2023
Messages
225
Points
78
Here's how I would summarize your situation:
  • There's interest in your project

  • You have enough data to launch an imperfect version -- you have almost 60% of the user data you wanted. Any first version of something like this will be imperfect; it's just that usually you don't know what all the problems are. In your case, you do know (the recommendations could be 60% more accurate!) You would probably get more actionable feedback from early users, to motivate you towards whatever the v1.1 version is

  • Tony probably doesn't have time to respond to you or work with you on this, but you have been unbanned
Therefore, you should go ahead and launch a version with 60% of the user data, and get some real user reactions, feedback, etc. See how well it works. You can make tweaks and changes to UI or whatever other aspects you need to, and then prior to launching the next version, write a note to Tony saying that you are going to re-crawl to get all the user data. In that (brief) message you can describe the user feedback so far (I assume it will mostly be good, if it's not then you might not want to continue!) and that will also demonstrate how the 5% temporary increase in traffic is worthwhile.

To do this long-term you probably would have to re-spider periodically, right -- so to show Tony & any other admins that it is worth it, you have to demonstrate that it's good, that people want it, that it works. Once you do that, the occasional 5% traffic spike will be no big deal, and maybe they'd even get around to doing API support for you (among a ton of other stuff I'm sure is on their list)
 

melchi

What is a custom title?
Joined
May 2, 2021
Messages
1,890
Points
153
I'm not sure how it is superior to the search function already there.

A "give me recommendations" feature might be nice but having a secondary account would not be. Does tony have a REPO that could be forked / merged to just add it into the main javascript or whatever is used?
 

UnknownNovelist

Well-known member
Joined
Oct 25, 2019
Messages
57
Points
58
I'm not sure how it is superior to the search function already there.

A "give me recommendations" feature might be nice but having a secondary account would not be. Does tony have a REPO that could be forked / merged to just add it into the main javascript or whatever is used?
It uses your current ScribbleHub profile - no need for a secondary account. You only need to have your profile and reading list enabled for public viewing (which is enabled by default).
 

Zirrboy

Fueled by anger
Joined
Jan 25, 2021
Messages
1,145
Points
153
Yes that would be ideal, but I don't know if Tony is willing to go to such hassle. I would, for one, welcome it.
It should be similar in structure to the existing interfaces, but my point is that if you care about crawling and its load impact in proportion to how much time you spent elaborating it, I'd say take no response from Tony as a no. Cloudflare and similar measures (that then get turned off again because they cause issues) doesn't make me think that large scale machine access would be a wanted thing, given the lack of statement to the contrary.

Not exactly. While trending, latest chapters, etc. do have their merits. It requires you to spend the effort and time in doing so. I mean, who hasn't scrolled through the top rankings endlessly in hopes to find something? The AI would be better at spotting specific tastes by analyzing what you like and then recommending what you haven't seen.
Based on data that is largely influenced by said algorithms. Plus, the ones willing to download an app for this one site would in my mind be mostly active users, who check in often enough to catch most ongoing novels as is, and probably have dug through searches (referring to the series finder).

This isn't an issue with the approach to use AI itself, even just the fuzzy synopsis search could help with novels that are swept under the rug due to the hard nature of the finder, and you've probably thought about this a lot more than me, but the AI models I know are unbiased towards the possible causes for the behaviors they replicate.

If for example your recommendation was done by training the AI to recommend me my reading list, and on use you simply take the next best offers after those, the results will likely mirror the most common exposure channels, since those are the ways I end up finding the novels that make it to my reading list.

Which isn't something I think more data alone will be able to remedy, but all I have to do is wait and see, I guess.

While the search function on SH is good, but that's only if you know the specific title. You, ofcourse, also have the Series Finder, but again, that requires you to know what genres, tags, etc. you want and input that manually.
I use the term search for the series finder. With how many novels you're able to ignore due to being discontinued I think that suffices for many purposes, as even without many genre restrictions the scale of options becomes manageable.

But my point there was supposed to be that a more powerful recommender has the downside of "speedrunning" to the best/most relevant content, meaning that once you're through with that the finds will only get worse.

With a really good novel once in a while I can still enjoy the decent to average ones in between those finds, but with the best back to back upfront the rest will feel worse to read than they would have been.
Which is something I think the finder is already capable of to some extent, so an adaptive network will likely hit worse.
 
Joined
Feb 6, 2021
Messages
2,327
Points
153
this is fantastic work. i can imagine how difficult it must've been to create.
i have a few concerns though. correct me if I'm wrong but this works solely on reading lists? if so then it will be severely inaccurate.
not everything in reading lists is read. most are added on a whim and never touched again. any predication made from such data will prove pointless.

are you planning on deploying it soon? I'd like to see how well it works and put my concerns to rest
 

UnknownNovelist

Well-known member
Joined
Oct 25, 2019
Messages
57
Points
58
It should be similar in structure to the existing interfaces, but my point is that if you care about crawling and its load impact in proportion to how much time you spent elaborating it, I'd say take no response from Tony as a no. Cloudflare and similar measures (that then get turned off again because they cause issues) doesn't make me think that large scale machine access would be a wanted thing, given the lack of statement to the contrary.
I just got a reply from Tony where he asked me to stop. So, the project is dead.
Based on data that is largely influenced by said algorithms. Plus, the ones willing to download an app for this one site would in my mind be mostly active users, who check in often enough to catch most ongoing novels as is, and probably have dug through searches (referring to the series finder).
I think you misunderstood. It's not an app. Just a site built with Python.
This isn't an issue with the approach to use AI itself, even just the fuzzy synopsis search could help with novels that are swept under the rug due to the hard nature of the finder, and you've probably thought about this a lot more than me, but the AI models I know are unbiased towards the possible causes for the behaviors they replicate.

If for example your recommendation was done by training the AI to recommend me my reading list, and on use you simply take the next best offers after those, the results will likely mirror the most common exposure channels, since those are the ways I end up finding the novels that make it to my reading list.

Which isn't something I think more data alone will be able to remedy, but all I have to do is wait and see, I guess.
It's true that if, for example, you only have the most common novels added to your reading list, it would also be generally harder to find something beyond what else you have in common that would suit your tastes. But, this is where a larger number of data points could be used to find clusters, and also why good data is important so that the hyper-parameters could be fine tuned for these cases. While the recommendations are based on the reading lists, I'm not saying that it's flawless - several downsides do exists, for example, if you have only added a small number of novels, it's less likely for the AI to find a good recommendation. The reverse is also true, if you have added everything and anything, it would also be difficult to categorize you. The ideal scenario would be a reader who has curated their primary reading list for their favorite novels that they really like, and this is something that the AI can use to find other novels that fit in that cluster group.
I use the term search for the series finder. With how many novels you're able to ignore due to being discontinued I think that suffices for many purposes, as even without many genre restrictions the scale of options becomes manageable.

But my point there was supposed to be that a more powerful recommender has the downside of "speedrunning" to the best/most relevant content, meaning that once you're through with that the finds will only get worse.
That's true. If you make it through the recommendation list, then you'll eventually get "recommendations" that are less in your taste. For this, I had added a UI element to visualize how likely it is you would like that novel. Nothing is infinite, not even the internet.
With a really good novel once in a while I can still enjoy the decent to average ones in between those finds, but with the best back to back upfront the rest will feel worse to read than they would have been.
Which is something I think the finder is already capable of to some extent, so an adaptive network will likely hit worse.
Maybe? I don't disagree completely with you, and I can genuinely see your point of view - as getting the best novels back-to-back could be worse. However, I also feel that using an AI to discover what you might have missed is a valid use-case. Something that might have been accidentally filtered out through the Series Finder or otherwise.

I really appreciate your feedback and I'll take it to heart. But it's unlikely that anything will come of it as Tony has formally asked me to stop scraping. So unless something changes in the future, I don't think the site will be a thing.
this is fantastic work. i can imagine how difficult it must've been to create.
i have a few concerns though. correct me if I'm wrong but this works solely on reading lists? if so then it will be severely inaccurate.
not everything in reading lists is read. most are added on a whim and never touched again. any predication made from such data will prove pointless.
That's indeed correct. If you have added anything and everything to your reading list, then there's not as high a chance that the AI will be able to find a good recommendation, or at worst, it'll be something completely random.

Edit: I had planned to use the "reading progress" information that is available in the users reading list to understand whether they had actually read a novel - and only factor in novels in which they had read some chapters. Alternatively, I could also have used the users rating of novels, and only factor in rated novels as part of the recommendation algorithm.
are you planning on deploying it soon? I'd like to see how well it works and put my concerns to rest
I'm planning on deploying it, but it will not be updated, as Tony has asked me to stop scraping.

Edit: To expand on my last comment. The project is *dead*. And I will not be scraping for the new dataset to further train the AI. In short, I'm back to square one and I don't know if I can continue the project.
 
Last edited:
Top