Hi everyone!
I'm back with an update:
@Tony has given the OK to post the link to the web app I made!
You can find ScribbleHub Explorer here: https://sh-explorer.streamlit.app/
Please note that this is still an early beta, so expect crashes, bugs, and inconsistent behavior. Also, the dataset is from the end of January/early February, so it is not updated with the latest information. Also note, that the AI recommendations are not at its best due to the poor dataset that it was trained on.
Since
@Tony asked me to stop scraping, I've been thinking of ways to get the required novel data and reading lists without scraping from ScribbleHub. Then, it struck me that I don't have to scrape ScribbleHub, since someone else is already doing so on ScribbleHub's request - Google! Every time Google visits ScribbleHub (which is quite often), they save a copy of the page that they visited and make it available as a cached page. This means I can simply adapt what I've made to crawl Google's cached pages instead of ScribbleHub.
The next problem is regarding users' reading lists. As I mentioned in my opening post, reading lists are fundamental in terms of training the AI. Although I have some rudimentary data that I can use, it is not optimal. Since I can no longer scrape ScribbleHub, I've thought of a few ways to get around this:
- When users want a recommendation, they'll first have to enter their reading list. For users with many novels in the reading list, this might not be sustainable, so I thought of making it possible to copy-paste the reading list's RSS feed into an input box. However, the RSS feed is limited to 25 items, which means that novels that update less regularly won't be included.
- Copy-paste the raw text of the reading list. The advantage here is that you'll get everything in the reading list, but it is also more error-prone. What if users do not copy-paste correctly? There are also instances of novels with duplicate titles, so you'll need to make a 1-to-1 comparison between a novel's title.
- Copy-paste the page source. This would be a better option, but it's also dependent on the user's skill. Not many people would know how to correctly view the page source of their reading list.
Since I cannot directly retrieve the user's reading list, I'll also have to figure out a way to save the input data, which means opening up a whole other can of worms. I'll need to figure out user registration, how to save users' reading lists to a database, how users can sync or update their reading lists with ScribbleHub, and so on.
A much more preferred option would be if ScribbleHub could offer an RSS feed of a user's entire reading list (not the latest chapter release, but of which novels are on the reading list). This would make it easier to retrieve the latest novels.
Another way would be to get approval from Tony/ScribbleHub to access users' reading lists and retrieve the necessary data upon user request. This would be different from the broad web-scraping that I did before, as it would only request data from the ScribbleHub servers when users requested it.
The ideal scenario would be for Tony/ScribbleHub to make an API available - one where you could access a novel index, novel information, public users, and public reading lists. Of course, such an API should be protected with an API key. I don't know if this is at all feasible, and it would depend solely on Tony.
Please let me know what you think, and feel free to give me some feedback on the web app!
Edit: Changed the post title, as I'm now looking for feedback :)