|
if I'm scraping a web page I'm worried about network performance, not regex performance
short of being on a SAN, which there'd be no reason to scrape except legacy integration, the network IO will outshadow any potential Regex performance issues by a large margin.
So I'm not worried about that.
Adding: If it really became an issue I could switch over to a non-backtracking engine like the one I wrote in C#
When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.
modified 20-Sep-19 8:50am.
|
|
|
|
|
I think there's a huge variation in performance between different regex engines.
Yes, it's never going to be lightning because of all the backward/forward matching going on but it can be a damned sight quicker than .NET would make it seem!
Whenever you find yourself on the side of the majority, it is time to pause and reflect. - Mark Twain
|
|
|
|
|
|
ah, cool, so someone has done it before, just not quite the same way.
I think my solution is simpler. They're munging the objects using python.
I want to make it so you can define JSON objects with just the regex. It will use nested group catures to build the JSON hierarchy.
Different enough to satisfy me that it's worth it.
Thanks for the link.
When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.
|
|
|
|
|
Can it go arbitrarily deep?
And remember that .net's Regular Expression engine is far richer than most, so you may not be describing a generally-applicable technique.
I've been working at loading data from JSON files to SQL Server for only about a year now. I convert JSON to XML on-the-fly and pass the XML elements to SQL Server for further processing and storage -- using SQL Server's built-in XML functions.
In my situation, it's all about getting data from file to table as quickly as possible (with an eye toward not hogging resources) and I never need to have all the objects in memory at once.
Like you, IO seems to be the main bottleneck, with writing to the database being slower than reading from the disk.
|
|
|
|
|
I'm targeting .NET so I'm not worried about it. If I were to port it to anything it would be something that at least used PCRE which is about as rich as .NET's regex, IIRC
But yeah, I'm looking at going arbitrarily deep. If I can't do it using nested group captures I'll do it by allowing you to define a pseudo-JSON document where each of the values is a regex expression instead of an actual value.
When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.
|
|
|
|
|
What problem does this solve? I am sure people are scraping websites just fine based looking at posts on freelance work websites.
"It is easy to decipher extraterrestrial signals after deciphering Javascript and VB6 themselves.", ISanti[ ^]
|
|
|
|
|
It would present a JSON based facade you can apply over a non-webservice, traditional website, so you can basically front the website with a JSON REST service using some regex.
So say I declare some regex captures over Wikipedia so I can scrape encyclopedia information with search queries.
That site is then exposed as a REST service that I can use as though it was designed for that.
That's the basic idea anyway. I'm simplifying here as much as I can - in truth it might not scale to complexity. I'm still toying with the idea.
one of the things you can do with it is expose it to the client browsers so they can query that way, or you can handle it using JSON processing on the server, or whatever. maybe insert it into a mongoDB or something. There's lots of potential use cases out there, I'd think.
When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.
|
|
|
|
|
The question is... what workload are you trying to circumvent here?
It sounds like all you are doing is providing the result in a JSON format, as opposed to say a class object if you were using a .Net library to do the job, for example. Which isn't particularly useful as it's a very simple task to implement already.
|
|
|
|
|
It would be part of the functionality of that JSON project i linked to so you are actually getting back class objects.
They're basically dictionary classes that support dynamic call sourcing ("expando" objects) that expose the fields as properties.
currently to do a rest call, you go something like
var args = new JsonObject();
args.Add("api_key", "c83a68923b7fe1d18733e8776bba59bb");
dynamic json = JsonRpc.Invoke("https://api.themoviedb.org/3/tv/2919", args);
Console.WriteLine(json.name);
and then you can do jsonpath queries or standard dictionary gets or dynamic calls (like above) on the object you get back. it's a dictionary with extra methods.
Or you can use any dictionary and get to the methods through static methods on JsonObject
There's lots of manipulation and querying features available, and features to help you create arrays of entities out of json arrays, etc.
This is the core library that I may add this support to.
So I'd add another way to do invocations using scraping and return the objects in the same manner.
When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.
|
|
|
|
|
So, scrape website and show it as JSON to others. I am not aware of anyone who would need it. If I need to scrape something, I need latest content from the page and not something historic from a database. I can scrape anything I want because I know exactly what I want. I would not try to spend time on another service and try to make it understand what I want.
Now, creating a JSON formatted string from, lets assume, a string array is something I may need. But then again, my JSON schema is going to be more or less unique to me.
If at all I care about historic pages, waybackmachine is my good friend.
And if non of this makes sense, there are already a large number or services that offer website scraping. I found this through a quick search: [Parsers - Free web scraper - Parsers] (https://parsers.me/).
"It is easy to decipher extraterrestrial signals after deciphering Javascript and VB6 themselves.", ISanti[ ^]
|
|
|
|
|
you make several good points
lw@zi wrote: Now, creating a JSON formatted string from, lets assume, a string array is something I may need. But then again, my JSON schema is going to be more or less unique to me.
Yeah i made sure you could do that and back again with my JsonArray class. =) It even has overloads for keyvaluepair generation since that's so common.
When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.
|
|
|
|
|
|
I have not. I'll check it out.
When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.
|
|
|
|
|
It's too easy.
When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.
|
|
|
|
|
Did you try Volapük and Hamlet?
I never got past the beginner crossword and I take pride in that
|
|
|
|
|
No i just clicked on tutorial, then experienced.
When I was growin' up, I was the smartest kid I knew. Maybe that was just because I didn't know that many kids. All I know is now I feel the opposite.
|
|
|
|
|
Well, duh, of course the tutorial is easy
I'm pretty sure the Hamlet will give you a proper challenge (but don't blame me if that takes up the rest of your day)
|
|
|
|
|
I was looking at this problem which is essentially a game known as Colonel Blotto
Each general must allocate a given number of troops across a given number of platoons which then battle against each other with results being decided on numerical supremacy. To win you need to both win the greatest number of battles and have the highest number of remaining troops.
So what we have is a zero-sum game which has twin objectives (somewhat like hi-lo poker variants but with a requirement to win both hands) with the additional complication of sequencing (a la rock-paper-scissors - e.g. if I set my troops out [19, 1, 19, 1, 19, 1, 19, 1, 19, 1], I might get a very different result than I would be sending out [1, 19, 1, 19, 1, 19, 1, 19, 1, 19] even though I'm using an identical strategy).
All of which gets me wondering - how well could we get a machine to play this game against a human opponent?
For the first element, calculating the optimal troop split, we could simply brute-force every possible combination, play them against each other and see what wins. Given the vast number of combinations involved, it's obviously a huge task but it's clearly do-able. We're essentially going to wind up with a ranked list of possible distributions). Let's assume that the results of these calculations will be revealed/knowable to the human opponent.
Now the human general has an advantage, he knows that the computer knows that the best distribution is distribution X and can easily counter (let's suppose the calculated "optimal" distribution is actually 10 platoons of 10 and the human knows that the computer will play the "best" distribution, the human simply plays 9 platoons of 11 and 1 of 0 and crushes him every time).
So the computer needs to throw a few curve balls - the easiest way to do this is to pick a random distribution from the top n distributions. The computer could further obfuscate by applying random increments and decrements to various pairs within the distribution or whatever.
Similarly when it comes to ordering the platoons, the computer's most effective approach might just be to randomise the whole distribution array, maybe it can work a more intricate algorithm based on previous distributions employed by the human general.
The human of course, has all the same curve balls at his disposal.
Machine learning is definitely on the menu and time is no object. All previous games are stored and available to both players for analysis. The human has all the necessary database skills and tools to view the data as meaningfully as possible.
The human general is so fond of Colonel Blotto and proud to be human that he is happy to play against the machine a million times over if necessary in order to prove that humans are smarter than tin.
Will he win?
Whenever you find yourself on the side of the majority, it is time to pause and reflect. - Mark Twain
|
|
|
|
|
PeejayAdams wrote: The human general is so fond of Colonel Blotto and proud to be human that he is happy to play against the machine a million times over if necessary in order to prove that humans are smarter than tin.
Will he win?
If the human is like me he has one huge advantage:
a bloody big hammer!
(well 2 advantages if you include my strength ... of rage.)
Message Signature
(Click to edit ->)
|
|
|
|
|
Fair point, I'll rephrase the question:
Will there come a point where the human general has to resort to extreme violence or praying that the cleaner unplugs the computer?
Whenever you find yourself on the side of the majority, it is time to pause and reflect. - Mark Twain
|
|
|
|
|
PeejayAdams wrote: human general has to resort to extreme violence
Every human on earth has the propensity for evil and violence.
so yes, the human general has to resort to extreme violence; it is his nature.
|
|
|
|
|
Slacker007 wrote: Every human on earth has the propensity for evil The true question here is what kind of evil. Chaotic evil, neutral evil or lawful evil.
I have lived with several Zen masters - all of them were cats.
His last invention was an evil Lasagna. It didn't kill anyone, and it actually tasted pretty good.
|
|
|
|
|
Well evil is relative in regards to "type", but the only constant is that evil is the result of negative energy. So does it really matter what kind it is? No, of course not.
|
|
|
|
|
If there were no random elements to the machine's strategy then I would be tempted to say the human could win over time. As if they are able to identify the machine's algorithm, they could potentially predict future moves.
However, given then random element, the human ultimately has no idea what the computer will play. Therefore in order to have a chance of winning, the human must also play random guesses.Therefore I believe the outcome will ultimately be based on luck alone, and could unpredictably go either way.
Having said that, the psychological element of the human could effect the true 'randomness' of their placements. For example, they may subconsciously implement a pattern to their randomness without realising it, and if the machine has been implemented to pick up on that, it could steal the upper hand. So in this scenario (where the human also plays random), I would probably choose machine.
But... the wiki link suggests that there are strategies to avoid losing. In which case the human can simply play that strategy every time, and because they know the machine won't play it. The human will always either draw, or win. If that strategy is permitted by the rules though (i.e. playing the same distribution each game), then the machine would be stupid not to also play that same strategy every game. So all games would be a draw, and therefore no winner.
So bottom line, I think it really depends on how prepared the human is before starting, and very much how the machine is programmed to play.
|
|
|
|
|