Spike solution problem specification
What sites to crawl?
- Google (search on "computer science internships near me" or whatever)
- NSF REU Site Directory
It's important to scrape as many of these sites as we can, because if we only scrape one or two, then students might wonder what the benefit of InternBit actually is--they could just do that manually themselves. A significant value-added of InternBit would come from it scraping as many Internship databases as possible, so students would be spared the time of trying all of the different ones. Even if one site has many duplicates of others, we do not know if that might change in the future. And even if there's just a couple of non-duplicate internships, those might be the good ones for at least some students.
What search parameters?
This is a complex question, since the answer depends upon two factors. First, each student has different skills and interests, which affects the parameter values. A student interested in "data science" will want to search for that, while a student interested in "machine learning" will want to search for something different. But each site will have its own query language which also affects the way the search is done. Searching in LinkedIn will be different from searching in Indeed.com or StudentOpportunityCenter.
Finally, what student should you use as your "test subject"? I suggest, for starters, use yourself! Look at your RadGrad profile, extract your skills and interests from it, and then use that to drive your scraping queries. You can use your own interests as an informal test of the efficiency of the process: can you build a scraper that returns results as good as you would have gotten by scraping the site by yourself?
Are credentials required?
It might be that searching a site effectively requires the student to login (and potentially set up a profile). Can InternBit, given the student's credentials, login automatically, potentially set up any profile information, and then do the search.
Is there an API?
Some sites (Chegg and Indeed?) provide a REST API which avoids the need to scrape the HTML of the site to extract the information. If an API is available for a site, definitely use it instead of scraping!
There's two ways to approach the implementation of spike solutions. One approach is to pick a scraper, say "Osmosis", and then see how many sites you can crawl with it. A second approach is to pick the most important two or three sites (say, LinkedIn, Indeed, and Google), and then figure out which scraper works best on them. You can pick whichever way appeals to you more. (Overlap between developers is fine.)
You will create a new GitHub repository to hold each spike solution (you can either pick a scraper, then try to crawl a bunch of sites, or pick what you believe to be the most important sites, and play around with scrapers to see which ones work the best).
Each repository will have the following naming structure: internbit-scraper-(initials)-(scraper or sites). Let's say you are Aubrie and you want to play around with the Osmosis library. Then your repo would be named "internbit-scraper-au-osmosis". If you are Jenny and you want to explore StudentOpportunityCenter and NSF REU, then your repo would be named "internbit-scraper-jh-soc-reu". (Yes, abbreviations are permitted to prevent crazy long names.)
Invoke as an npm script
Each system will be implemented as an npm script. So, running:
will invoke the scraper script in the package.json file, which will run your system. You should have a json file that contains parameters. If you need to supply credentials, you could read them as command line arguments to avoid putting them into a json file that would be committed to github. For example:
might log me in to Linked In, if my username was philipmjohnson and my password was foo, which it isn't.
You may want to google on 'write npm scripts' to find all sorts of tutorials.